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Preface 


This  book  contains  referred  papers  presented  at  the  Seventh  IEEE  Workshop  on 
Neural  Networks  for  Signal  Processing  (NNSP  ’97)  held  at  the  Amelia  Island 
Plantation,  Amelia  Island,  Florida  on  September  24-26,  1997. 

The  Neural  Networks  Technical  Committee  of  the  IEEE  Signal  Processing  Society 
sponsored  NNSP  ’97,  in  cooperation  with  the  IEEE  Neural  Network  Council  and 
with  co-sponsorship  from  the  Air  Force  Office  of  Scientific  Research  (AFOSR). 
We  designed  the  workshop  to  serve  as  a  regular  forum  for  researchers  in  academia 
and  industry  who  are  interested  in  the  exciting  field  of  neural  networks  for  signal 
processing.  Neural  networks  offer  a  fresh  view  for  the  important  problems  faced  in 
signal  processing  because  they  extend  linear  models  and  go  beyond  the  assump¬ 
tions  of  stationarity  and  Gaussianity  traditionally  imposed  in  signal  processing. 

This  year  we  announced  two  topics  in  the  call  for  papers.  The  goal  was  to  create  a 
critical  mass  of  submissions  and  dedicate  a  full  session  to  discuss  a  topic  of  current 
interest.  This  year’s  topics  are  blind  signal  processing  and  biomedical  applications. 
Each  is  important  in  its  own  right.  Blind  signal  processing  is  a  difficult  but  excit¬ 
ing  area  of  signal  processing  with  many  practical  applications  for  which  the  use  of 
nonlinearity  is  key  for  acceptable  solutions.  The  biomedical  area  has  long  been  a 
challenging  area  due  to  the  imprecise  nature  of  human  reasoning  and  the  need  for 
more  sophisticated  quantitative  tools.  Neural  networks  and  other  approximate  rea¬ 
soning  methods  are  key  players  in  this  effort.  We  hope  that  this  approach  of  elect¬ 
ing  topics  will  be  successful  and  will  make  these  proceedings  a  necessary 
reference  for  the  advances  reported  in  each  field. 

Our  deep  appreciation  is  extended  to  Drs.  Simon  Haykin,  S.Y.  Kung,  J.F.  Cardoso, 
Yann  LeCun  and  David  Brown  for  their  insightful  plenary  talks.  Our  sincere 
thanks  go  to  all  members  of  the  Technical  Committee  for  the  excellent  and  timely 
reviews,  and  above  all  to  the  authors  whose  contributions  made  this  workshop 
possible. 

Continuing  with  the  tradition  of  paperless  communication,  this  year's  reviews  and 
announcements  were  all  electronic.  Thanks  to  Dong- Wei  Chen  and  Craig  Fancourt 
for  keeping  the  NNSP  ’97  Web  page  (http://www.cnel.ufl.edu/nnsp97/)  current 
and  effective.  Special  thanks  go  to  Ms.  Sharon  Bosarge  for  her  dedication  and  hard 
work  to  coordinate  the  many  details  necessary  to  put  together  the  program  and  the 
local  arrangements. 
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THEORY 


Invited  Lecture 


Chaos,  radar  clutter,  and  neural  networks 

Simon  Haykin 

McMaster  University 
Ontario,  Canada 

The  lecture  will  be  in  three  parts  dealing  in  a  coordinated  way  with  the  issues  of 
chaos  and  neural  networks.  In  the  first  part  of  the  lecture,  I  will  outline  the  set  of 
criteria  for  determining  if  a  given  experimntal  (physical)  time  series  is  indeed  cha¬ 
otic.  The  criteria  include:  Tests  for  nonlinearity,  reliable  estimations  of  the  correla¬ 
tion  dimension,  characteristic  time  delay,  embedding  dimension,  and  the  Liapunov 
spectrum. 

In  the  second  part  of  the  lecture,  I  will  present  a  case  study  based  on  real  life  data 
of  sea  clutter  (i.e.,  radar  backscatter  from  an  ocean  surface)  and  demonstrate  how 
the  above-mentioned  criteria  are  satisfied  in  a  very  convincing  way.  The  third  part 
of  the  lecture  addresses  how  a  neural  network  can  be  used  to  peform  dynamic 
reconstruction  on  an  experimental  time  series  known  to  be  chaotic.  The  problem  is 
usually  complicated  by  the  unavoidable  presence  of  additive  noise.  For  this  part  of 
the  study,  I  will  present  the  results  of  a  detailed  study  involving  the  following 
learning  algorithms: 

Regularized  radial  basis  function  network. 

Support  vector  machine. 

The  results  of  these  evaluations  will  be  checked  against  the  chaos  theory  described 
in  the  first  part  of  the  lecture. 
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NONPARAMETRIC  REGRESSION 
MODELING  WITH  TOPOGRAPHIC  MAPS 
AS  A  BASIS  FOR 
LOSSY  IMAGE  COMPRESSION 


Marc  M.  Van  Hiille 

Laboratorium  voor  Neuro-  en  Psychofysiologie 
K.U.  Leuven 
Campus  Gasthuisberg 
Herestraat 

B-3000  Leuven,  BELGIUM 
Tel.:  +  32  16  34  59  61  Fax:  +  32  16  34  59  93 
E-mail:  marc@neuro.kuIeuven.ac.be 


Abstract —  We  introduce  a  new  approach  to  lossy  image  com¬ 
pression  with  topographic  maps  based  on  nonparametric  regres¬ 
sion  modeling:  the  topographic  maps  are  trained  to  perform  non¬ 
parametric  regression  using  our  recently  introduced  learning  rule, 
called  the  Maximum  Entropy  learning  Rule  [9,  10],  in  combination 
with  projection  pursuit  regression  learning  [11].  Furthermore,  in 
order  to  better  account  for  the  local  image  statistics,  we  apply 
a  technique  similar  to  subspace  classification.  Finally,  we  com¬ 
pare  the  performance  of  our  approach  to  that  of  the  Karhunen- 
Loeve  transform  and  the  optimally  integrated  adaptive  learning 
algorithm  [7]. 


INTRODUCTION 

Kohonen’s  Self-Organizing  (feature)  Map  (SOM)  algorithm  [1]  is  aimed  at 
establishing,  in  an  unsupervised  way,  a  mapping  from  a  higher  dimensional 
space  of  input  signals  onto  an  equal  or  lower  dimensional  discrete  lattice  of 
formal  neurons.  The  algorithm  was  originally  conceived  for  nonparametric 
regression  analysis,  whereby  the  converged  topographic  map  was  intended  to 
capture  the  main  dimensions  of  the  input  space  ([2],  p.  152).  Hence,  the  algo¬ 
rithm  can  be  used  for  the  regression  modeling  of  multi- variable  functions,  at 
least  in  principle  since  it  often  yields  “nonfunctional”  mappings  (one  input 
can  map  onto  more  than  one  output,  see  [3,  4]).  In  addition  to  nonpara¬ 
metric  regression,  there  also  exists  an  intimate  connection  with  the  classic 
fc-means  clustering  algorithm  [12,  1.3]  and  the  LBG  algorithm  [14]  for  building 
vector  quantizers  (except  for  the  neighborhood  function,  see  [15]).  Hence, 
depending  upon  the  interpretation  adopted,  the  SOM  algorithm  performs 
regression  modeling  or  vector  quantization  (VQ).  In  the  VQ  case,  the  SOM 
algorithm  can  be  used  as  a  “lossy”  compression  technique:  the  converged 
neuron  weights  form  an  optimal  set  of  codewords  -optimal  with  respect  to 
the  mean  squared  error  (MSE)  distortion  due  to  quantization-  and  the  min- 
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imum  Euclidean  distance  between  an  input  sample  and  the  neuron  weights 
defines  the  code  membership  function.  Such  a  VQ-based  coding  is  radically 
different  from  a  (linear)  transform  coding  in  which  each  input  sample  is  pro¬ 
jected  along  a  limited  number  of  projection  directions  -limited  compared  to 
the  dimensionality  of  the  input  space.  The  optimal  linear  transformation 
is  the  Karhunen-Loeve  transform  (KLT)  since  it  minimizes  the  MSE  [5]  or, 
equivalently  for  this  technique,  since  it  maximizes  the  norm  of  the  projected 
input  vector. 

The  assumptions  upon  which  the  optimality  conditions  are  based  can  be 
questioned,  specifically  the  use  of  global  statistics  for  generating  an  optimal 
coding  scheme  may  not  be  appropriate  when  the  input  distribution  is  not 
stationary.  In  an  attempt  to  remedy  this  problem,  and  to  account  for  the 
local  statistics  as  well,  several  adaptive  techniques  have  been  devised,  also 
in  the  neural  network  field.  The  SOM  algorithm  was  used  as  a  basis  of 
subspace  classification  [6];  the  resulting  structure  is  referred  to  as  the  adap¬ 
tive  subspa^e  self-organizing  algorithm  [2].  More  recently,  Dony  and  Haykin 
[7]  proposed  an  adaptive  scheme,  which  favorably  combined  VQ  with  KLT, 
called  the  mixture  of  principal  components  (MPC).  Specifically,  it  partitions 
the  input  distribution  into  a  number  of  regions  or  classes  as  in  VQ.  Within 
each  class,  an  input  vector  is  represented  by  a  linear  combination  of  a  limited 
number  of  basis  vectors  which  define  a  subspace  in  a  manner  analogous  to  a 
principal  components  representation.  The  principal  components  are  obtained 
with  a  neural  network  learning  rule,  such  as  Sanger’s  [8]  or  Oja’s  [6],  since 
such  an  iterative  approach  to  KLT  requires  less  storage  overhead  and  can  be 
computationally  more  efficient  than  the  algebraic  approaches  which  operate 
on  the  sample  covariance  matrix  directly  [7].  The  learning  scheme  as  a  whole 
is  called  the  optimally  integrated  adaptive  learning  (OIAL)  algorithm. 

In  this  paper,  we  will  introduce  a  novel  way  to  perform  image  compression 
with  neural  networks:  we  will  use  topographic  maps,  trained  for  regression 
purposes,  which  become  part  of  a  technique  similar  to  subspace  classification. 
The  topographic  maps  are  trained  with  our  recently  introduced  rule,  called 
the  Maximum  Entropy  learning  Rule  (MER)  [9,  10],  in  combination  with 
projection  pursuit  regression  learning  [1 1]. 


COMBINING  PROJECTION  PURSUIT 
REGRESSION  WITH  MER 

Consider  the  regression  fitting  of  a  scalar  function  y  of  rf  —  1  independent 
variables,  denoted  by  the  vector  x  =  [a?i, ...,  from  a  given  set  of  M 

possibly  noisy  data  points  or  measurements  {(x”',y’”),m  =  1,...,M}  in  d- 
dimensional  space: 

y”*  =  /(x"")  +  7101.56,  (1) 

with  /  the  unknown  function  to  be  estimated,  and  where  the  noise  contribu¬ 
tion  has  zero  mean  and  is  independent  from  the  {x’”}.  In  projection  pursuit 
regression  (PPR)  [11],  the  rf-dimensional  data  points  are  interpreted  through 
optimally-chosen  lower  dimensional  projections;  the  “pursuit”  part  refers  to 
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optimization  with  respect  to  these  projection  directions.  We  will  limit  our¬ 
selves  to  the  case  where  the  function  /  is  approximated  by  a  sum  of  scalar 
functions  fk- 

/w  = 

fc=i 


with  /  the  approximation  of  /,  and  a*;  the  projection  directions  (unit  vectors) 
and  where  T  stands  for  transpose.  The  fk  are  piecewise  smooth  activation 
functions  or  splines  that  join  continuously  at  points  called  “knots.”  The  func¬ 
tions  fk  and  projections  are  estimated  sequentially  in  order  to  minimize 
the  mean  squared  error  (MSE)  of  the  residuals: 


1  ^ 


Tn=l 


fc-1 


-i2 


fk{sik-x^  ^  2/""  “ 


Jk'  =  l 


(3) 


and  where  the  term  between  the  curly  brackets  denotes  the  fcth  residual 
of  the  mth  data  point,  r^,  with  ±  .  In  other  words,  each  newly- 

added  projection  at  and  activation  function  fk  are  developed  in  the  space 
of  the  residuals  rjf*  which  remain  unaccounted  for  by  the  sum  of  the  k  —  I 
previously-determined  activation  functions. 

If  we  consider  the  knots  as  neurons  of  a  one-dimensional  topographic  map, 
then  we  can  determine  the  position  of  these  knots  adaptively  by  developing 
the  map  in  the  two-dimensional  space  V  of  the  inputs  {(aArX*”  ),  rj*}-  As  an 
example,  consider  the  one-dimensional  lattice  shown  in  Fig.  lA.  The  neuron 
weights  Wi  =  lwiuWi2]  will  be  updated  with  our  previously  developed  learn¬ 
ing  rule,  called  the  Maximum  Entropy  learning  Rule  (MER)  [9,  10].  Before 
we  introduce  our  rule,  we  first  consider  our  definition  of  quantization  region 
which  is  quite  different  from  the  one  used  in  a  Voronoi  or  Dirichlet  tessella¬ 
tion  of  the  input  space,  as  done  in  the  SOM  algorithm.  By  observing  that 
rk  is  the  dependent  variable,  each  quantization  region  can  be  defined  wHh 
respect  to  the  projection  direction  a^  as  the  area  demarcated  by  the  (a^x  )- 
axis  (horizontal)  coordinates  of  two  successive  neurons  of  the  lattice  (thin 
dashed,  vertical  lines  in  Fig.  lA):  e.g.  quantization  region  He  is  bounded  by 
the  horizontal  weight  coordinates  of  neurons  i  and  j.  We  associate  with  each 
quantization  region  a  code  membership  function: 


ll/fi(afcx^) 


^  ifa^x^G//, 
0  if  ajkX^  i  Hj , 


(4) 


with  riH  y  nn  ^  {1)2})  number  of  neurons  that  bound  Hj,  and  (a^x  ) 
the  projected  "input.  (Note  that  since  the  quantization  regions  are  bounded 
by  the  horizontal  weight  coordinates,  their  code  membership  functions  only 
depend  on  the  projected  inputs).  The  Maximum  Entropy  learning  Rule 
(MER)  is  defined  as: 

Awi  =  r/  ^  lli/j  (ajfcx^)  5flrn(v  -  Wj),  Vi  G  A,  (5) 

HjeSi 


6 


with  r)  the  learning  rate  (a  positive  constant),  w,-  =  the  weight 

vector  of  neuron  i,  v  =  [(ajbX^),  r^]  G  V  the  current  input  vector,  Si  the 
two  quantization  regions  that  have  neuron  t  as  a  common  bound,  and  Sgn(.) 
the  sign  function  taken  componentwise.  The  effect  of  a  single  MER  update 
is  shown  in  Fig.  lA  (thick  dashed  line).  It  can  be  formally  proven  that  for 
a  one-dimensional  lattice,  developed  in  one-dimensional  space,  MER  yields 
an  equiprobable  quantization  at  convergence  for  any  N  [9],  and  that  in  the 
multi-dimensional  case,  MER  yields  a  quantization  which  will  approximate 
an  equiprobable  one  at  convergence  for  large  N  [10],  This  implies  that,  e.g. 
for  neuron  j  in  Fig.  IB,  there  will  be  an  equal  number  of  data  points  below 
and  above  j  {i.e.  in  the  light  and  dark  shaded  regions).  Since  this  is  the  case 
for  every  neuron  of  the  lattice,  there  will  be,  roughly  speaking,  an  equal  num¬ 
ber  of  data  points  from  the  set  {(ajbx"'^,  r^)}  above  and  below  the  piecewise 
smooth  activation  function  ft  (thick  full  line)  at  convergence.  Furthermore, 
the  neuron  weights  will  represent  the  medians  of  the  corresponding  quan¬ 
tization  regions:  each  weight  vector  will  converge  to  the  median,  with  the 
“median”  defined  as  the  vector  of  the  (scalar)  medians  in  each  input  dimen¬ 
sion  separately.  (Note  that  there  exists  no  unique  definition  of  median  for  the 
higher-than-one  dimensional  case.)  Finally,  since  a  one-dimensional  lattice 
is  guaranteed  to  converge  to  an  unfolded  one  in  one-dimensional  space  [10], 
the  lattice  of  Fig.  lA  is  guaranteed  to  converge  to  one  which  will  always  be 
unfolded  with  respect  to  the  horizontal  axis  and,  hence,  we  are  assured  to 
obtain  a  functional  mapping. 

Finally,  the  efficiency  with  which  the  input  statistics  is  modeled  can  be 
improved  by  considering  several  regression  models  in  parallel  and  treat  them 
as  classes.  Each  model  can  be  trained  with  the  aforementioned  MER/PPR 
combination  on  common  or  on  separate  data  sets.  In  case  of  the  former, 
a  class  membership  definition  is  needed.  After  training,  data  compression 
can  be  performed  using  the  weights  and  projection  directions  of  the  trained 
regression  models;  model  selection  occurs  according  to  the  class  membership 
function  definition  adopted  for  this  application. 


IMAGE  COMPRESSION 

Training  of  regression  models 

Consider  a  grey  scale  image  sized  M  x  M  pixels  in  which  we  select  an 

m  X  m  region  or  block  (Fig.  2).  The  selected  region  is  divided  in  two  parts: 
1)  the  central  pixel  at  row  i  and  column  j,  and  2)  the  surrounding  pixels 
(shaded  area),  termed  sxirround{i^j).  Consider  now  the  regression  fitting  of 
the  grey  level  of  the  central  pixel  as  a  function  of  the  vector  of  grey  levels  of 
the  surround:  the  pixels  in  the  surround  define  an  m  x  m  —  1  dimensional 
vector  of  independent  variables  and  the  central  pixel  represents  the  possibly 
noisy  measurement  of  the  unknown  function  which  is  to  be  estimated. 

In  order  to  better  capture  the  local  image  statistics,  we  consider  a  total  of 
L  classes  in  parallel.  Each  class  is  represented  by  a  regression  model  eq.  (2) 
but  with  different  projection  directions  and  weight  vectors.  We  divide  the 
original  image  into  L  subimages  sized  M  x  M  pixels.  We  assume  a  toroidal 
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Figure  1:  (A)  Definition  of  quantization  region  in  a  one-dimensional  lattice. 
The  weights  of  the  lattice  are  joined  by  line  segments  so  as  to  yield  the  acti¬ 
vation  function  /*.  The  thick  full  and  thick  dashed  lines  represent  fk  before 
and  after  the  weights  are  updated  once  using  MER,  respectively.  The  thin 
dashed,  vertical  lines  represent  the  borders  of  the  quantization  regions  prior 
to  updating  the  weights.  The  shaded  region  corresponds  to  the  receptive 
field  of  neuron  j  and  it  comprises  the  quantization  regions  He  and  Hd>  The 
present  input  is  indicated  by  the  black  dot  in  //c.  (B)  At  convergence,  there 
will  be  an  equal  number  of  data  points  in  tile  dark  and  light  shaded  regions, 
below  and  above  neuron  j,  respectively.  The  thick  full  line  represents  the 
piecewise  linear  activation  function  /*. 


j 

Figure  2:  Definition  of  central  pixel  at  coordinate  {i,j)  and  the  vector  of 
surrounding  pixels  (shaded  area),  termed  surrottnd(i,  j). 
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extension  for  each  siibimage  and  a  lateral  shift  of  one  pixel  between  two 
successive  image  blocks  so  that,  in  this  way,  we  dispose  of  measurements 
per  subimage  in  an  m  x  ni  dimensional  data  space.  Each  regression  model 
is  trained  on  a  different  subimage  until  convergence.^  The  latter  greatly 
simplifies  the  computational  complexity  of  the  algorithm,  and  the  overall 
training  time,  and  it  alleviates  the  problem  of  how  to  properly  initialize  the 
regression  models  (see  e.g.  [7]). 

Defiiiitioii  of  codebook  vector  and  class  membership 

Before  we  can  perform  image  compression,  we  have  to  decide  on  the  defini¬ 
tions  of  codebook  vector  and  class  membership.  The  latter  will  be  similar 
but  not  equal  to  the  one  used  in  subspace  classifiers.  Assume  that  we  use 
L  classes  of  which  the  corresponding  regression  models  consist  of  N  neurons 
and  K  projection  directions,  for  simplicity’s  sake.  Each  regression  model  is 
trained  on  blocks  sized  m  x  m  pixels  as  explained  in  the  previous  paragraph. 
We  now  partition  the  image  into  nonoverlapping  but  juxtaposed  blocks  sized 
mxm  pixels  and  project  the  mx  m  —  i  dimensional  surround  vector  of  each 
block  (c/.  Fig.  2)  along  the  I<  projection  directions  of  each  class.  The  nearest 
neuron  weight  vectors  (in  Euclidean  sense),  along  each  of  the  K  projection 
directions,  are  then  used  for  discretizing  the  projected  surround  vector  and 
the  binary  indices  of  these  weight  vectors  for  storing  the  discretized  surround 
vector;  the  central  pixel  value  need  not  be  discretized  (and  stored)  since  it  can 
be  predicted  from  the  regression  model  of  the  selected  class.  The  discretized 
surround  vector  can  be  computed  from  its  K  discretized  projection  coordi¬ 
nates.  Hence,  for  each  class,  we  store  the  N  x  K  weight  vectors  from  which 
the  codebook  vectors  needed  for  decompression  can  be  determined. 

Finally,  for  cases  in  which  L  >  1,  we  use  the  following  definition  of  class 
membership:  for  each  image  block,  we  use  the  class  which  produces  the 
smallest  sum  of  the  following  two  quantities:  1)  the  quantization  error  be¬ 
tween  the  projected  and  the  original  surround  vector,  and  2)  the  regression 
error  between  the  predicted  and  the  original  central  pixel  value.  The  binary 
code  of  the  selected  class  is  then  stored  together  with  the  binary  code  of 
the  K  discretized  projection  coordinates.  Decompression  is  achieved  when 
the  binary  codes  are  substituted  for  the  corresponding  codebook  vectors;  the 
central  pixel  value  is  obtained  from  the  corresponding  regression  model.  The 
image  obtained  in  this  way  is  termed  the  decompressed  image.  The  quality 
of  the  decompressed  image  can  be  improved  when,  for  L  =  1,  the  regression 
model  is  applied  to  the  decompressed  image  but  now  with  overlapping  image 
blocks.  For  Z.  >  1,  we  can  take  a  conservative  approach  by  selecting  the 
regression  model  which  yields  the  smallest  discrepancy  between  the  central 
pixel  value  it  produces  and  the  one  we  had  before  in  the  decompressed  image. 

^  Since  we  do  not  treat  the  sii bimages  as  precla-ssified  data,  we  also  considered  the 
alternative  case  where  a  single  training  set  is  used  and  where  training  of  each  one 
of  the  L  regres.sion  models  continues  on  those  training  samples  for  which  it  yields 
the  smallest  regression  errors  until  all  samples  are  consistently  represented.  How¬ 
ever,  since  this  recursive  approach  yielded  only  a  minor  improvement  for  the  case 
reported  in  the  Simulation  Results  section,  but  at  the  expense  of  a  much  increased 
computational  complexity  and  training  time,  we  did  not  con.sider  it  further. 
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SIMULATION  RESULTS 


In  the  simulations,  we  consider  8  bit  images  and  surrounds  sized  5x5—1  pixels 
(t.e.  m  =  5).  We  use  lattices  of  N  neurons,  with  cubic  spline  interpolation 
between  the  neuron  weights,  K  projection  directions  and  L  regression  models 
in  such  a  way  that  (7\r  +  1)  x  x  L  =  64  (we  use  (TV  -f  1)  to  account  for  the 
space  needed  for  storing  the  codebook  vectors).  The  compression  ratio  can 
be  estimated  as  follows:  for  the  original  image  we  need  8  bits  per  pixel  {bpp) 
and  for  the  compressed  image  we  need  {N  -i-  1)  x  K  x  L  values  or  6  bits  to 
code  for  the  vector  of  grey  values  of  each  m  x  m  block  in  which  the  image  is 
partitioned,  i.e.  0.24  bpp. 

We  run  MER  in  batch  mode  (?;  =  0.02)  and  determine,  after  each  epoch, 
the  MSE  between  the  actual  and  the  desired,  equiprobable  code  membership 
function  usage.  We  run  MER  until  the  magnitude  of  the  difference  between 
the  present  and  the  previous  running-averaged  MSE  is  lower  than  1.0  10“^  or 
until  15,000  epochs  have  elapsed;  the  present  running  average  equals  10%  of 
the  present,  unaveraged  MSE  added  to  90%  of  the  previous  running  average. 
In  order  to  optimize  C(ajfc),  the  procedure  is  run  for  the  taken  as  unit 
vectors;  the  components  of  the  unit  vector  with  the  lowest  residual  error  are 
then  further  optimized  by  performing  hill  descent  on  C'(ajt)  in  steps  of  0.01; 
after  each  update,  a*  is  renormalized  to  unit  length.  Finally,  in  order  to 
further  optimize  the  K  projection  directions  obtained,  we  apply  backfitting: 
we  cyclically  minimize  C(aA')  for  the  residuals  of  projection  direction  fe,  until 
there  is  little  or  no  change  (<  0.1%)  or  until  10  full  cycles  have  elapsed.  We 
also  ensure  that,  after  each  update,  a^  is  renormalized. 


As  an  example,  we  consider  the  LENA  image  (Fig.  3A),  sized  512  x  512 
pixels  at  8  bits  per  pixel.  We  modify  the  grey  scale  of  the  image  from  [0, 255] 
to  [0, 10]  and  select  L  subimages  sized  128  x  128  pixels  (Fig.  3B)  and  use  the 
16,384  data  vectors  corresponding  to  each  subimage  for  training  the  lattices 
as  explained  in  the  previous  section.  For  the  time  being,  we  have  considered 
only  L  =  1,  2  and  4.  When  L  =  1  only  subimage  1  is  used,  when  L  =  2 
subimages  1  and  2  are  used,  and  so  on.  In  order  to  quantify  the  simulation 
results,  we  compute  the  MSE  (MSEc)  and  the  signal-to-noise-ratio  due  to 
quantization  between  the  original  and  the  decompressed  image: 


SNRc  =  lOIogjo 


(dB) 


(6) 


with  7  the  decompressed  image.  The  results  for  the  LENA  image  are  sum¬ 
marized  in  Table  1;  the  second  column  lists  the  actual  configuration  used, 
abbreviated  symbolically  as  L/N/K;  the  last  two  columns  list  the  corre¬ 
sponding  MSEc  and  SNRc  values.  We  observe  that  the  performance  of 
MER/PPR  improves  when  N  is  lowered  from  15  to  7  so  that  L  and  K  can 
increase.  The  decompressed  image  for  L/N/K  =  *21714  is  shown  in  Fig.  3C. 
Using  the  same  standard  image,  the  results  obtained  with  the  OIAL  and  KLT 
algorithms  are  also  shown  in  Table  1  [16].  The  OIAL  algorithm  uses  K  =  4 
projections  and  L  =  128  classes  and  the  KLT  algorithm  uses  K  =  4  projec¬ 
tions  only,  both  for  0.25  bpp  but  for  blocks  sized  8x8  pixels.  We  observe 
that  MER/PPR  performs  reasonably  well  for  TV  =  7,  however,  we  hasten  to 
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Figure  3:  (A)  Original  LENA  image,  (B)  siibimages  used  for  training,  and 
(C)  decompressed  image  obtained  with  MER/PPR  {L/N/K  =  2/7/4). 


Table  1:  Performances  obtained  with  the  MER/PPR,  OIAL  and  KLT  algo¬ 
rithms 


algorithm 

L/N/K 

MSEc 

SNRc 

1/7/8 

82.6 

23.0 

2/7/4 

56.7 

24.6 

4/7/2 

69.8 

23.8 

MER/PPR 

1/15/4 

89.7 

22.7 

91.1 

22.6 

OIAL 

128/-/4 

54.9 

24.8 

KLT 

1/-/4 

71.0 

23.7 

add  that  much  more  images  should  be  considered  before  any  judgment  can 
be  made. 


CONCLUSION 


In  this  paper,  we  have  introduced  a  new  approach  to  lossy  image  compression 
using  topographic  maps  and  a  technique  similar  to  subspace  classification. 
Within  each  class,  a  number  of  topographic  maps  were  trained  so  that  a  non- 
parametric  regression  model  of  the  local  image  statistics  is  obtained.  The 
topographic  maps  were  trained  using  our  recently  introduced  rule,  called  the 
Maximum  Entropy  learning  Rule  (MER)  [9,  10],  in  combination  with  projec¬ 
tion  pursuit  regression  (PPR)  learning  [11].  The  use  of  PPR  in  combination 
with  MER  offers  the  advantage  that  we  don’t  need  a  prohibitive  amount  of 
neurons  for  regression  modeling  in  high  dimensional  spaces  and  a  high  num¬ 
ber  of  input  samples  to  allocate  the  neuron  weights  reliably  (c/.  the  curse  of 
dimensionality).  Furthermore,  and  in  particular  with  respect  to  small  data 
sets,  since  with  MER  the  neuron  weights  converge  to  the  medians  of  their 
quantization  regions,  the  regression  models  obtained  will  be  less  sensitive  to 
outliers  but,  on  the  other  hand,  they  will  be  more  sensitive  to  biased  noise. 
Fortunately,  the  effect  of  the  latter  can  be  reduced  by  backfitting.  Finally, 
since  we  have  essentially  trained  our  topographic  maps  for  nonparametric 
regression  purposes,  we  could  equally  well  have  considered  noise  cancelling 
as  an  application.  This  will  be  explored  in  our  future  work. 
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Abstract 

We  discuss  an  unsupervised  learning  method  which  is  driven  by  an  informa¬ 
tion  theoretic  based  criterion.  Information  theoretic  based  learning  has  been 
examined  by  several  authors  Linsker  [2,  3],  Bell  and  Sejnowski  [5],  Deco  and 
Obradovic  [1],  and  Viola  etal  [6].  The  method  we  discuss  differs  from  previous 
work  in  that  it  is  extensible  to  a  feed-forward  multi-layer  perceptron  with  an 
arbitrary  number  of  layers  and  makes  no  assumption  about  the  underlying 
PDF  of  the  input  space.  We  show  a  simple  unsupervised  method  by  which 
multi-dimensional  signals  can  be  nonlinearly  transformed  onto  a  maximum 
entropy  feature  space  resulting  in  statistically  independent  features. 


1.0  INTRODUCTION 

Our  goal  is  to  develop  mappings  that  yield  statistically  independent  features.  We 
present  here  a  nonlinear  adaptive  method  of  feature  extraction.  It  is  based  on  con¬ 
cepts  from  information  theory,  namely  mutual  information  and  maximum  cross¬ 
entropy.  The  adaptation  is  unsupervised  in  the  sense  that  the  mapping  is  determined 
without  assigning  an  explicit  target  output,  a  priori,  to  each  exemplar.  It  is  driven, 
instead,  by  a  global  property  of  the  output:  cross  entropy. 

There  are  many  mappings  by  which  statistically  independent  outputs  can  be 
obtained.  At  issue  is  the  usefulness  of  the  derived  features.  Towards  this  goal  we 
apply  Linsker’s  Principle  of  Information  Maximization  which  seeks  to  transfer 
maximum  information  about  the  input  signal  to  the  output  features.  It  is  also  shown 
that  the  resulting  adaptation  rule  fits  naturally  into  the  back-propagation  method  for 
training  multi-layer  perceptrons. 

Previous  methods  [1]  have  optimized  entropy  at  the  output  of  the  mapping  by  con¬ 
sidering  the  underlying  distribution  at  the  input.  This  represents  a  complex  problem 
for  general  nonlinear  mappings.  The  method  presented  here,  by  contrast,  is  more 
directly  related  to  the  technique  of  Bell  and  Sejnowski  [5]  in  which  we  manipulate 
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entropy  through  observation  at  the  output  of  the  mapping.  Specifically,  we  exploit  a 
property  of  entropy  coupled  with  a  saturating  nonlinearity  which  results  in  a 
method  for  entropy  manipulation  that  is  extensible  to  feed-forward  multi-layer  per- 
ceptrons  (MLP).  The  technique  can  be  used  for  an  MLP  with  an  arbitrary  number 
of  hidden  layers.  As  mutual  information  is  a  function  of  two  entropy  terms,  the 
method  can  be  applied  to  the  manipulation  of  mutual  information  as  well. 

In  section  2  we  discuss  the  concepts  upon  which  our  feature  extraction  method  is 
based.  We  derive  the  adaptation  method  which  results  in  statistically  independent 
features  in  section  3.  An  example  result  is  presented  in  section  4,  while  our  conclu¬ 
sions  and  observations  appear  in  section  5. 

2.0  BACKGROUND 

The  method  we  describe  here  combines  cross  entropy  maximization  with  Parzen 
window  probability  density  function  estimation.  These  concepts  are  reviewed. 

2.1  Maximum  Entropy  as  a  Self-organizing  Principle 

Maximum  entropy  techniques  have  been  applied  to  a  host  of  problems  (e.g.  blind 
separation,  parameter  estimation,  coding  theory,  etc.).  Linsker  [2]  proposed  maxi¬ 
mum  entropy  as  a  self-organizing  principle  for  neural  systems.  The  basic  premise 
being  that  any  mapping  of  a  signal  through  a  neural  network  should  be  accom¬ 
plished  so  as  to  maximize  the  amount  of  information  preserved.  Linsker  demon¬ 
strates  this  principle  of  maximum  information  preservation  for  several  problems 
including  a  deterministic  signal  corrupted  by  gaussian  noise.  Mathematically  Lin- 
sker’s  principle  is  stated 

I{x,y)  =  hy{y)~hy^x^y\x)  (1) 

where  I{x,y)  is  the  mutual  information  of  the  RVs  X  and  7,  and  h(^  ])  is  the 
continuous  entropy  measure  [4].  Given  the  RV  (random  vector),  7  e  ,  the  con¬ 
tinuous  entropy  is  defined  as 


hy{u)  =  -  j  lOg(fy(u))fy{u)du,  (2) 

—CO 

where  fy{u)  is  the  probability  density  function  of  the  RV,  the  base  of  the  logarithm 
is  arbitrary,  and  the  integral  is  N  -fold.  Several  properties  of  the  continuous  entropy 
measure  are  of  interest. 

1 .  If  the  RV  is  restricted  to  a  finite  range  in  91^  the  continuous  entropy  measure  is 
maximized  for  the  uniform  distribution. 

2.  If  the  covariance  matrix  is  held  constant  the  measure  is  maximized  for  the  nor¬ 
mal  distribution. 
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N  N 

3.  If  the  RV  is  transformed  by  a  mapping  g:  9?  ^  'J?  then  the  entropy  of  the 
new  RV,  y  =  g(jc) ,  satisfies  the  inequality 

with  equality  if  and  only  if  the  mapping  has  a  unique  inverse,  where  is  the 
Jacobian  of  the  mapping  from  X  to  7. 

Regarding  the  first  two  properties  we  note  that  for  either  case  each  element  of  the 
RV  is  statistically  independent  from  the  other  elements. 

Examination  of  (3)  implies  that  by  transforming  a  RV  we  can  increase  the  amount 
of  information.  This  is  a  consequence  of  working  with  continuous  RVs.  In  general 
the  continuous  entropy  measure  is  used  to  compare  the  relative  entropies  of  several 
RVs.  We  can  see  from  (3),  that  if  two  RVs  are  mapped  by  the  same  invertible  linear 
transformation  their  relative  entropies  (as  measured  by  the  difference)  remains 
unchanged.  However,  if  the  mapping  is  nonlinear,  in  which  case  the  second  term  of 
(3),  is  a  function  of  the  random  variable,  it  is  possible  to  change  relative  informa¬ 
tion  of  two  random  variables.  From  the  perspective  of  classification  this  is  an 
important  point.  If  the  mapping  is  topological  (in  which  case  it  has  a  unique 
inverse),  there  is  no  increase,  theoretically,  in  the  ability  to  separate  classes.  That  is, 
we  can  always  reflect  a  discriminant  function  in  the  transformed  space  as  a  warping 
of  another  discriminant  function  in  the  original  space.  However,  finding  the  dis¬ 
criminant  function  is  a  different  problem  altogether.  By  changing  the  relative  infor¬ 
mation,  the  form  of  the  discriminant  function  may  be  simpler. 

This  is  not  true,  however,  for  a  mapping  onto  a  subspace.  Our  implicit  assumption 
here  is  that  we  are  unable  to  reliably  determine  a  discriminant  function  in  the  full 
input  space.  As  a  consequence  we  seek  a  subspace  mapping  that  is  in  some  measure 
optimal  for  classification.  We  cannot  avoid  the  loss  of  information  (and  hence  some 
ability  to  discriminate  classes)  when  using  a  subspace  mapping.  However,  if  the  cri¬ 
terion  used  for  adapting  the  mapping,  is  entropy  based,  we  can  perhaps  minimize 
this  loss.  It  should  be  mentioned  that  in  all  classification  problems  there  is  an 
implicit  assumption  that  the  classes  to  be  discriminated  do  indeed  lie  in  a  subspace. 

2.2  Nonparametric  Pdf  Estimation 

One  difficulty  in  applying  the  continuous  entropy  measure  with  continuous  RVs  is 
that  it  requires  some  knowledge  of  the  underlying  PDF  (probability  distribution 
function).  Unless  assumptions  are  made  about  the  form  of  the  density  function  it  is 
very  difficult  to  use  the  measure  directly.  A  nonparametric  kernel-based  method  for 
estimating  the  PDF  is  the  Parzen  window  method  [7].  The  Parzen  window  estimate 
of  the  probability  distribution,  /^(m)  ,  of  a  random  vector  7  g  'JI  at  a  point  u  is 
defined  as 


hiu) 


N. 


N, 

S  ^(7r 


•w). 


r /  =  1 


(4) 
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The  vectors  y  -  are  observations  of  the  random  vector  and  k([  ])  is  a  kernel 

function  which  itself  satisfies  the  properties  of  PDFs  (i.e.  k(m)>0  and 
jK(u)du  =  1 ).  Since  we  wish  to  make  a  local  estimate  of  the  PDF,  the  kernel 
function  should  also  be  localized  (i.e.  uni-modal,  decaying  to  zero).  In  the  method 
we  describe  we  will  also  require  that  k([  ])  be  differentiable  everywhere.  In  the 
multidimensional  case  the  form  of  the  kernel  is  typically  gaussian  or  uniform.  As  a 
result  of  the  differentiability  requirement,  the  gaussian  kernel  is  most  suitable  here. 
The  computational  complexity  of  the  estimator  increases  with  dimension,  however, 
we  will  be  estimating  the  PDF  in  the  output  space  of  our  multi-layer  perceptron 
where  the  dimensionality  can  be  controlled. 


3.0  DERIVATION  OF  LEARNING  ALGORITHM 

As  we  stated  our  goal  is  to  find  statistically  independent  features;  features  that 
jointly  posses  minimum  mutual  information  or  maximum  cross  entropy. 

Suppose  we  have  a  mapping  \  M  <  A ,  of  a  random  vector  A  e  91  , 

which  is  described  by  the  following  equation 

F  =  g(a,A)  (5) 

How  do  we  adapt  the  parameters  a  such  that  the  mapping  results  in  a  maximum 
cross-entropy  random  variable?  If  we  have  a  desired  target  distribution  then  we  can 
use  the  Parzen  windov/s  estimate  to  minimize  the  “distance”  between  the  observed 
distribution  and  the  desired  distribution.  If  the  mapping  has  a  restricted  range  (as 
does  the  output  of  an  MLP  using  sigmoidal  nonlinearities),  the  uniform  distribution 
(which  has  maximum  entropy  for  restricted  range)  can  be  used  as  the  target  distri¬ 
bution.  If  we  adapt  the  parameters,  a ,  of  our  mapping  such  that  the  output  distribu¬ 
tion  is  uniform,  then  we  will  have  achieved  statistically  independent  features 
regardless  of  the  underlying  input  distribution. 

Viola  et  al  [6]  has  taken  a  very  similar  approach  to  entropy  manipulation,  although 
that  work  differs  in  that  it  does  not  address  nonlinear  mappings  directly,  the  gradi¬ 
ent  method  is  estimated  stochastically,  and  entropy  is  worked  with  explicitly.  By 
our  choice  of  topology  (MLP)  and  distance  metric  we  are  able  to  work  with  entropy 
indirectly  and  fit  the  approach  naturally  into  a  back-propagation  learning  paradigm. 

As  our  minimization  criterion  we  use  integrated  squared  error  between  our  estimate 
and  the  desired  distribution,  which  we  approximate  with  a  summation. 


J  =  \\  (fy(u)-fY(u,y))^du 

n, 

y  =  {y\-yNy) 


(6) 


In  (6),  Qy  indicates  the  nonzero  region  (a  hypercube  for  the  uniform  distribution) 
over  which  the  M-fold  integration  is  evaluated.  The  criterion  above  exploits  the 
fact  that  the  MLP  with  saturating  nonlinearities  has  finite  support  at  the  output.  This 
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fact  coupled  with  property  1  (i.e.  as  the  integrated  squared  error  between  the 
observed  output  distribution  and  the  uniform  distribution  is  minimized,  entropy  is 
maximized)  makes  the  criterion  suitable  for  entropy  manipulation. 

Assuming  the  output  distribution  is  sampled  adequately,  we  can  approximate  this 

A/ 

integral  with  a  summation  in  which  Uj  e  9?  are  samples  in  M -space  and  Aw  is 
represents  a  volume. 

The  gradient  of  the  criterion  function  with  respect  to  the  mapping  parameters  is 
determined  via  the  chain  rule  as 


da 


S/A  dg 


da^ 


J 


da) 


(7) 


where  £y(wy,  y)  is  the  computed  distribution  error  over  all  observations  y .  The  last 
term  in  (7),  dg/ da ,  is  recognized  as  the  sensitivity  of  our  mapping  to  the  parame¬ 
ters  a .  Since  our  mapping  is  a  feed-forward  MLP  (a  represents  the  weights  and 
bias  terms  of  the  neural  network),  this  term  can  be  computed  using  standard  back- 
propagation.  The  remaining  partial  derivative,  df  /dg,  is 


Sg 


Substituting  (8)  into  (7)  yields 


(8) 


The  terms  in  (9),  excluding  the  mapping  sensitivities,  become  the  new  error  term  in 
our  backpropagation  algorithm.  This  adaptation  scheme  is  depicted  in  figure  1, 
which  shows  that  this  adaptation  scheme  fits  neatly  into  the  backpropagation  para¬ 
digm. 

Examination  of  the  gaussian  kernel  and  its  differential  in  two  dimension  illustrates 
some  of  the  practical  issues  of  implementing  this  method  of  feature  extraction  as 
well  as  providing  an  intuitive  understanding  of  what  is  happening  during  the  adap- 
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tation  process.  The  N-dimensional  gaussian  kernel  evaluated  at  some  u  is  (simpli¬ 
fied  for  two  dimensions) 


{2n)  |Z|  ^  , 


=  -^expf-iy, —  m1  ;  S- 


27ra 


2a 


a2  0 
0  a^ 


(10) 


,iV  =  2 


The  partial  derivative  of  the  kernel  (also  simplified  for  the  two-dimensional  case) 
with  respect  to  the  input  y.  as  observed  at  the  output  of  the  MLP  is 


dK 

dy. 


exp(^-^(;;.-M)tS  m) 


=  K(y.-t/)E  \u-y.) 

exp|^ — \{yi-u)^{y^-u^ 


p  \yi-u) 


iKC 


(w-T/)  ;  ^  = 


a^  0 
0  a^ 


,N=2 


(11) 


These  functions  are  shown  in  figure  2.  The  contour  of  the  gaussian  kernel  is  useful 
in  that  it  shows  that  output  samples,  y  • ,  greater  than  two  standard  deviations  from 
the  center,  u ,  of  the  kernel  (in  the  feature  space)  do  not  significantly  impact  the 
estimate  of  the  output  PDF  at  that  sample  point.  Likewise,  the  gradient  term,  is  not 
significant  for  output  samples  exceeding  two  standard  deviations  from  the  kernel 
center.  Consequently  sample  points  for  the  PDF  estimate  should  not  exceed  a  dis¬ 
tance  of  two  standard  deviations  from  each  other,  otherwise,  samples  caught  “in 
between”  do  not  contribute  significantly  to  the  estimate  of  the  PDF.  A  large  number 
of  such  samples  can  cause  very  slow  adaptation. 


Recall  that  the  terms  in  (9)  replace  the  standard  error  term  in  the  backpropagation 
algorithm.  This  term  is  plotted  as  a  surface  in  figure  2  minus  the  PDF  error.  From 
this  plot  we  see  that  the  kernels  act  as  either  local  attractors  or  repellors  depending 
on  whether  the  computed  PDF  error  is  negative  (repellor)  or  positive  (attractor).  In 
this  way  the  adaptation  procedure  operates  in  the  feature  space  locally  from  a  glo¬ 
bally  derived  measure  of  the  output  space  (PDF  estimate). 


4.0  EXPERIMENTAL  RESULTS 

We  have  conducted  experiments  using  this  method  on  millimeter-wave  ISAR 
(inverse  synthetic  aperture  radar)  images  (64  x  64  pixels).  The  mapping  structure 
we  use  in  our  experiment  is  a  multi-layer  perceptron  with  a  single  hidden  layer 
(4096  input  nodes, 4  hidden  nodes,  2  output  nodes).  Using  the  adaptation  method 
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described,  we  trained  the  network  on  two  vehicle  types  with  ISAR  images  from  1 80 
degrees  of  aspect.  The  projection  of  the  training  images  (and  between  aspect  testing 
images)  is  shown  in  figure  3  (where  adjacent  aspect  training  images  are  connected). 
As  can  be  some  significant  class  separation  is  exhibited  (without  prior  labeling  of 
the  classes).  We  also  note  that  the  points  where  the  classes  overlap  correspond  to 
the  cardinal  aspect  angles,  which  are,  in  general,  difficult  aspect  angles  to  separate 
on  similar  vehicles  in  this  type  of  imagery. 

5.0  CONCLUSIONS 

We  have  presented  what  we  believe  to  be  a  new  method  of  unsupervised  learning. 
This  method  unlike  previous  methods  is  not  limited  to  linear  topologies  [3]  nor  uni- 
modal  PDFs  [5].  In  effect,  we  achieve  features  which  are  statistically  independent 
from  each  other  and  yet  are  still,  clearly,  structurally  related  to  the  input  structure  as 
exhibited  by  the  results  of  our  example.  This  property  bears  similarity  to  Kohonen’s 
discrete  SOFM,  however  our  map  exists  in  a  continuous  output  space.  We  are  pur¬ 
suing  in  our  research  more  rigorous  analysis  in  the  comparison  of  the  resulting  fea¬ 
ture  maps  to  the  Kohonen  type.  We  are  utilizing  this  method  as  a  preprocessing  for 
classification  in  our  continuing  research,  although  other  applications  certainly  exist 
(e.g.  blind  separation). 
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Figure  2  The  plots  above  assume  that  we  are  using  a  two- 
dimensional  gaussian  kernel  with  a  diagonal  covariance 
matrix  with  on  the  diagonals.  Contour  of  the  gaussian 
kernel  ^op  left,  normalized  by  a),  surface  plots  of  the 
gradient  terms  with  respect  to  (top  right),  (bottom 
left),  and  magnitude  (bottom  right)  all  normalized  by  a  . 
These  terms  are  essentially  zero  at  a  distance  of  two 
standard  deviations. 


Figure  3  Example  of  training  on  ISAR  images  of  two  vehicles 
(aspect  varying  over  1 80  degrees).  Over  most  of  the  aspect 
angles  the  vehicles  are  separated  in  the  new  feature  space. 
Adjacent  aspect  angles  are  connected  in  the  training  set, 
evidence  that  topological  neighborhoods  were  maintained. 
The  mapping  also  generalizes  to  the  testing  set  as  well. 
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Abstract.  In  this  paper  we  present  a  regularization  scheme  which 
iteratively  adapts  the  regularization  parameters  by  minimizing  the 
validation  error.  It  is  suggested  to  use  the  adaptive  regulariza¬ 
tion  scheme  in  conjunction  with  Optimal  Brain  Damage  pruning 
to  optimize  the  architecture  and  to  avoid  overfitting.  Furthermore, 
we  propose  an  improved  neural  classification  architecture  eliminat¬ 
ing  an  inherent  redundancy  in  the  widely  used  SoftMax  classifica¬ 
tion  network.  Numerical  results  demonstrate  the  viability  of  the 
method. 

INTRODUCTION 

Neural  networks  are  flexible  tools  for  pattern  recognition  and  by  expanding 
the  network  architecture  any  relevant  target  function  can  be  approximated 
[6].  In  this  contribution  we  present  an  improved  version  of  the  neural  classi¬ 
fier  architecture  based  on  a  feed-forward  net  with  SoftMax  [2]  normalization 
presented  in  [7],  [8]  avoiding  an  inherent  redundant  parameterization.  The 
outputs  of  the  network  estimate  the  class  conditional  posterior  probabilities 
and  the  network  is  trained  using  a  maximum  a  posteriori  (MAP)  framework. 

The  associated  risk  of  overfitting  on  noisy  data  is  of  major  concern  in 
neural  network  design  [4].  The  objective  of  architecture  optimization  is  to 
minimize  the  generalization  error.  The  architecture  can  be  optimized  directly 
by  e.g.,  pruning  techniques  or  indirectly  by  using  regularization.  One  might- 
consider  various  regularization  schemes:  from  adapting  a  single  regulariza¬ 
tion  parameter  to  individual  regularization  of  the  weights  in  the  net.  These 
subjects  are  further  addressed  in  [9],  [10].  We  suggest  a  hybrid  approach 
with  Optimal  Brain  Damage  [11]  for  pruning  and  an  adaptive  regularization 
scheme.  The  inevitable  problem  of  adapting  the  amount  of  regularization  is 
solved  by  minimizing  the  generalization  error  w.r.t.  regularization  parame¬ 
ters.  Using  the  validation  error  calculated  from  a  single  validation  set  as  an 
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estimate  of  the  generalization  error,  it  is  possible  to  formulate  an  iterative 
gradient  descent  scheme  for  adapting  the  regularization  parameters  [9].  The 
Bayesian  way  to  adapt  regularization  parameters  is  to  minimize  the  evidence 
[1,  Ch.  10],  [14];  however,  the  evidence  does  not,  in  a  simple  way,  relate  to 
the  generalization  error  which  is  our  primary  object  of  interest. 


NETWORK  ARCHITECTURE 

Suppose  that  the  input  (feature)  vector  is  denoted  by  x  with  dim(aj)  =  ni. 
The  aim  is  to  model  the  posterior  probabilities  p{Ci\x),  i  =  1, 2,  •  •  • ,  c  where  Ci 
denotes  the  I’th  class.  Then  under  a  simple  loss  function  the  Bayes  optimal^ 
classifier  assigns  class  label  Ci  to  x  ii  i  =  aigmaxj p{Cj\x). 

Following  [8]  (see  also  [1]),  the  outputs,  yi,  of  the  neural  network  rep¬ 
resent  estimates  of  the  posterior  probabilities,  i.e.,  pi  =  p{Ci\x)',  hence, 
—  1-  That  is,  we  need  merely  to  estimate  c  —  1  posterior 
probabilities,  say  p{Ci\x),  i  =  1,2, •  •  •  ,c  —  1,  then  the  last  is  calculated  as 
p(Cc\x)  =  1  -  p{Ci\x)- 

Define  a  2-layer  feed-forward  network  with  nj  inputs,  uh  hidden  neurons 
and  c  —  1  outputs  by: 

(ni  \  riH 

Y^wjiXi  +  wjo  ,  (t>i{x)  =  Y,^ijhj{x)  +  wf^  (1) 

e=i  J  j=i 

where  are  the  input-to-hidden  and  hidden-to-output  weights,  respec¬ 

tively.  All  weights  are  assembled  in  the  weight  vector  w  =  {wj^,w^}. 

In  order  to  interpret  the  network  outputs  as  probabilities  a  modified  nor¬ 
malized  exponential  transformation  similar  to  SoftMax  [2]  is  used. 


Vi 


exp(</>i) 


C~1 


ELi  exp((t>j)  +  1  ’ 


i  =  1,  yc  =  l-y^y<.  (2) 


1=1 


The  modification  amounts  to  fixing  exp(<;ic)  in  the  standard  SoftMax  at  1 
eliminating  the  inherent  redundancy  of  the  output  weights  as  also  mentioned 
in  [18,  p.  150].  The  redundancy  implies  that  a  particular  set  of  outputs, 
i  =  1, 2, •  ■  •  ,c  induces  a  one-dimensional  sub-manifold  in  weight  space.  The 
network  architecture  is  shown  in  Fig.  1. 


TRAINING  AND  REGULARIZATION 

Assume  that  we  have  a  training  set  T  of  Nt  related  input-output  pairs  T  = 
{(a:(fc),y(A:))}f^i  where 


f  1  if  x{k)  e  Ci 
\  0  otherwise 


(3) 


^That  is,  each  misclcissification  is  equally  serious  corresponding  to  minimal  probability 
of  misclassification. 
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Figure  1:  Neural  network  architecture. 

The  likelihood  of  the  network  parameters  is  given  by  (see  e.g.,  [1],  [8]), 

Nt  Nt  c 

p(rit«) = np(j/wi®(fc),«’) = n  n  w 

k=l  k=l  i=l 

where  y(k)  =  y{x{k),w)  is  a  function  of  the  input  and  weight  vectors.  The 
training  error  is  the  normalized  negative  log-likelihood 

St{w)  =  -r^iogp(rH  =  jr-'^Hy{k},y{i:y,w)  (5) 

with  £{')  denoting  the  loss  given  by 

c  /  c-1  \  C-1 

£{y{k),y{ky,w)  ^'^yi{k)\og  1  +  ^exp(<^j(a;(A:))) 

i=i  \  j=i  ) 

(6). 

The  objective  of  training  is  minimization  of  the  regularized  cost  function^ 

C{w)  =  Sr{'^)  +  C^) 

where  the  regularization  term  R{w^  k)  is  parameterized  by  a  set  of  reg¬ 
ularization  parameters  k.  Training  provides  the  estimated  weight  vector 
w  =  arg  min^„  C{w)  and  is  done  using  a  Gauss-Newton  scheme, 

^new  ^  ^old  _  .  J-1  (8) 

^This  might  be  viewed  as  a  maximum  a  posteriori  (MAP)  method. 
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where  t)  is  the  step-size  (line  search  parameter).  For  that  purpose  we  require 
the  gradient,  V{w)  =  dCjdw,  and  the  Hessian,  J{w)  =  d‘^CIww^  of  the 
cost  function  given  by, 


dw 


— ^  ^  bi  W  -  2/iW] 

k=l  i=l 


d(l)i{x{k))  dR{w,K,) 
dw  dw  ’ 


(9) 


Nt  c—l  c— 1 

j(w)  =  ^ 

k=i  i=i  i=i 


d(j)i{x{k))  d(l)j{x{k)) 
dw  dw^ 


d^R{w,K) 

dwdw^ 

(10) 


Here  Sij  is  the  Kronecker  delta  and  we  have  used  the  Gauss-Newton  approx¬ 
imation  to  the  Hessian. 


ADAPTING  REGULARIZATION  PARAMETERS 

The  available  data  set,  D,  of  N  examples  is  split  into  two  disjoint  sets:  a 
validation  set,  V,  with  Nv  =  [7^1  examples  for  architecture  selection  and 
estimation  of  regularization,  and  a  training  set,  T,  with  Nt  =  N  —  Ny  exam¬ 
ples  for  estimation  of  network  parameters.  7  is  referred  to  as  the  split-ratio. 
The  validation  error  of  the  trained  network  is  given  by 


Sv{w)  = -^'Ve{y{k),y{ky,w)  (11) 

k=i 

where  the  sum  runs  over  the  Ny  validation  examples.  Sv{w)  is  thus  an 
estimate  of  the  generalization  error  defined  as  the  expected  loss:  G{w)  ~ 
E^^y{i{y,  y;  u))},  where  Ej,^y{-}  denotes  the  expectation  w.r.t.  the  joint  input- 
output  distribution. 

Aiming  at  adapting  the  regularization  parameters  k,  so  that  the  validation 
error  is  minimized  we  can  apply  the  iterative  scheme  suggested  in  [9]: 

^new  ^  ^old  _  (12) 

where  /x  is  a  step-size  and  w{k9^^)  is  the  estimated  weight  vector  using  the 
regularization  parameter  Suppose  the  regularization  term  is  linear  in 

the  regularization  parameters,  i.e., 

9 

R{w,  k)  =  K''’T’(ti;)  =  ^  Kiri{w)  (13) 

t=i 


where  Ki  are  the  regularization  parameters  and  ri{w)  the  associated  regular¬ 
ization  functions.  The  gradient  of  the  validation  error  then  equals  [9]: 


dr  f 


(14) 
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Consider  the  specific  case  of  weight  decay  regularization  with  separate  weight 
decays  for  input-to-hidden  and  hidden- to  output  layers,  i.e., 

R{w,k)  =  Kj  •\w^\^  kh  (15) 

where  k,=  [k/,  khY  3-^^  with  dim(ti;^)  =  m/,  dim(t/;^)  =  tuh 

and  dim(w)  =  m  =  mj  -h  mjy. 

The  gradient  then  yields, 

g(u))  =  -2({«0^.ff„„  g(S)  =  -2(ffi")T.p^^  (16) 

where  g  =  =  J~\w)  ■  dSviw)/dw. 

In  summary  the  algorithm  for  adapting  regularization  parameters  consists 
of  the  following  8  steps: 

1.  Choose  the  split  ratio  7  between  training  and  validation  set  sizes. 

2.  Initialize  k  and  the  weights  of  the  network. 

3.  Train  the  network  with  fixed  k  to  achieve  w(k).  Calculate  the  valida¬ 
tion  error  Sy. 

4.  Calculate  the  gradient  dS^^/dn  cf.  Eq,  (14).  Initialize  the  step-size  pi. 

5.  Update  /c  using  Eq.  (12). 

6.  Retrain  the  network  from  the  previous  weights  and  calculate  the  vali¬ 
dation  error  Sy. 

7.  If  no  decrease  in  validation  error  then  perform  a  bisection  of  /x  and  goto 
step  5;  otherwise,  continue. 

8.  Repeat  steps  4-7  until  the  relative  change  in  validation  error  is  below  a 
small  percentage  or,  e.g.,  the  2-norm  of  the  gradient  dSy/dK,  is  below 
a  small  number. 

PRUNING 

In  order  to  reduce  and  optimize  the  network  architecture  we  suggest  to  use 
a  pruning  scheme,  e.g.,  Optimal  Brain  Damage  (OBD)  [11].  An  alternative- 
method  is  Optimal  Brain  Surgeon  (OBS)  [5];  however,  in  a  series  of  experi¬ 
ments  we  noticed  that  extreme  care  is  essential  in  order  not  to  underestimate 
the  saliencies  [16].  Thus  OBS  is  less  robust  than  OBD. 

OBD  ranks  the  weights  according  to  importance  or  saliency.  Here  we  use 
the  validation  error  based  OBD  proposed  in  [9].  The  saliency  for  weight  i  is 
given  by 


(17) 


By  repeatedly  removing  weights  with  small  saliencies  and  retraining  the 
resulting  network,  a  nested  family  of  network  architectures  is  obtained.  The 
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validation  error  (or  an  alternative  measure  of  generalization  performance^)  is 
then  used  for  selecting  the  optimal  architecture. 

EXPERIMENTS 

We  test  the  performance  of  the  adaptive  regularization  algorithm  on  a  vowel 
classification  problem.  The  data  are  based  on  the  Peterson  and  Barney 
database  [17].  The  classes  are  vowel  sounds  characterized  by  the  first  four 
formant  frequencies.  76  persons  (33  male,  28  female  and  15  children)  have 
pronounced  c  =  10  different  vowels  (lY  IH  EH  AE  AH  AA  AO  UH  UW  ER) 
two  times.  This  results  in  a  data  base  of  totally  1520  examples.  The  database 
is  the  verified  database  described  in  [22]  where  all  data^  are  used,  including 
examples  where  utterance  failed  of  unanimous  identification  in  the  listening 
test  (26  listeners).  All  examples  were  included  to  make  the  task  more  difficult. 

The  examples  were  split  into  a  data  set,  P,  consisting  oi  N  ~  760  ex¬ 
amples  (16  male,  14  female  and  8  children)  and  an  independent  test  set  of 
the  remaining  760  examples.  The  regularization  was  adapted  by  splitting  the 
data  set  V  equally  into  a  validation  set  of  Ny  —  380  examples  and  a  training 
set  of  Nt  =  380  examples  (8  male,  7  female  and  4  children  in  each  set). 

Suppose  that  the  network  weights  are  given  by  in  =  [in^,  ^biasl 

where  are  input-to-hidden  and  hidden-to-output  weights,  respec¬ 
tively,  and  the  bias  weights  are  assembled  in  and  in^as-  example, 

we  use  the  following  weight  decay  regularization  term: 

R{W,  k)  =  k‘  ■  \w’\^  +  •  kLsP  +  (18) 

where  k  =  [/^^j /^biasj  ^Sasl-  We  further  define  the  normalized  weight 
decays  as  a  =  k  •  Nt.  The  simulation  set-up  was: 

•  Network:  4  inputs,  5  hidden  neurons,  9  outputs^. 

•  The  training  input  data  were  normalized  to  zero  mean  and  unit  variance 
in  order  to  facilitate  training  and  weight  initialization. 

•  Weights  were  initialized  uniformly  over  [—0.5, 0.5],  regularization  pa¬ 
rameters  were  initialized  at  zero.  10  steps  in  a  gradient  descent  train¬ 
ing  algorithm  (see  e.g.,  [12])  was  performed  and  the  weight  decays,  k, 
were  re-initialized  at  Amax/10^,  where  Amax  is  the  max.  eigenvalue  of 
the  Hessian  matrix  of  the  cost  function.  This  initialization  scheme  is 
motivated  by  the  following  observations: 

-  Weight  decays  should  be  so  small  that  they  do  not  reduce  the 
approximation  capabilities  of  the  network  significantly. 

^E.g.,  the  previously  suggested  algebraic  estimate  [8],  [15]. 

^The  database  can  be  retrieved  from  ftp://eivind.iimn.dtu.dk/dist/data/vowel/ 
PetersonBarney . tar . Z 

®We  only  need  9  outputs  since  the  posterior  class  probability  of  the  10th  class  is  given 

by  1  - 
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NNet 

KNN  {K  =  9) 

Training 

0.105  ±0.008 

0.150 

Validation 

0.115  ±0.005 

0.158 

Test 

0.122  ±0.005 

0.199 

Test  after  retrain. 

0.119  ±0.003 

0.153 

Table  1:  Probability  of  misclassification,  pmc.  For  the  neural  network  the  averages 
and  standard  deviations  over  6  runs  are  reported. 

-  They  should  be  so  large  that  the  algorithm  is  prevented  from  being 
trapped  in  a  local  optimum  and  numerical  instabilities  are  elimi¬ 
nated. 

•  Training  is  now  done  using  a  Gauss-Newton  algorithm  (see  e.g.,  [12]). 
The  Hessian  is  inverted  using  the  Moore-Penrose  pseudo  inverse  (see 
e.g.,  [19])  ensuring  that  the  eigenvalue  spread®  is  less  than  10®. 

•  The  regularization  step-size  r]  is  initialized  at  1. 

•  When  the  adaptive  regularization  scheme  has  terminated  we  prune  3% 
of  the  weights  using  a  validation  set  based  version  of  the  Optimal  Brain 
Damage  recipe  [9],  [11]. 

•  We  alternate  between  pruning  and  adaptive  regularization  until  the 
validation  error  has  reached  a  minimum. 

•  Finally,  remaining  weights  are  retrained  on  all  data  using  the  optimized 
weight  decay  parameters. 

Table  1  reports  the  average  and  standard  deviations  of  the  probability  of 
misclassification  (pmc)  over  6  runs  for  pruned  networks  using  the  optimal 
regularization  parameters.  Note  that  retraining  on  the  full  data  set  decreases 
the  test  pmc  slightly  on  the  average;  improvement  was  found  in  4  out  of  6 
runs.  For  comparison  we  used  a  FC-nearest-neighbor  (KNN)  classification,  see 
e.g.,  [1]  and  found  that  FC  =  9  was  optimal  on  the  validation  set.  Note  that 
the  neural  network  performed  significantly  better.  Contrasting  the  obtained 
results  to  other  work  is  difficult.  In  [20]  results  on  the  Peterson-Barney  vowel 
problem  are  reported,  but  their  data  are  not  exactly  the  same;  only  the  first 
2  formant  frequencies  were  used.  Furthermore,  different  test  sets  have  been 
used  for  the  different  methods  presented.  The  best  result  reported  [13]  is 
obtained  by  using  KNN  and  reach  pmc  =  0.186  which  is  somewhat  higher 
than  our  results. 

In  Fig.  2  the  evolution  of  the  adaptive  regularization  as  well  as  the  pruning 
algorithm  is  demonstrated. 

CONCLUSIONS 

This  paper  presented  a  framework  for  design  of  neural  classifiers  which  in¬ 
clude  architecture  optimization  by  pruning  and  adaptation  of  regularization 

®  Eigenvalue  spread  should  not  be  larger  than  the  square  root  of  the  machine  precision 

[3]. 
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Figure  2:  Panels  (a),  (b)  and  (c)  show  the  evolution  of  the  adaptive  regularization 
algorithm  in  a  typical  run.  Optimal  weight  decays  axe  found  by  minimizing  the 
validation  error  in  (a).  Note  that  also  the  test  errors  decreases.  This  tendency  is 
also  evident  in  (b)  displaying  pmc  even  though  a  small  increase  is  noticed.  In  (c)  the 
normalized  weight  decays,  a  =  k  -  Nt,  are  depicted,  (d)  and  (e)  show  the  evolution 
of  errors  and  pmc  during  pruning.  The  optimal  network  having  minimal  validation 
error  is  indicated  by  the  vertical  line.  There  is  only  a  marginal  effect  of  pruning. 
Finally,  the  variation  of  the  optimal  normalized  weight  decays  (before  pruning)  in 
different  runs  is  shown  in  (f)  and  is  seen  to  be  relatively  small. 
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parameters.  Moreover,  an  improved  neural  net  architecture  was  presented. 

Numerical  examples  demonstrated  the  potential  of  the  framework. 
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Abstract 

There  has  been  much  interest  in  learning  long-term  temporal  dependencies 
with  neural  networks.  Adequately  learning  such  long-term  information  can  be 
useful  in  many  problems  in  signal  processing,  control  and  prediction. 

A  class  of  recurrent  neural  networks  (RNNs),  NARX  neural  networks,  were 
shown  to  perform  much  better  than  other  recurrent  neural  networks  when  learn¬ 
ing  simple  long-term  dependency  problems.  The  intuitive  explanation  is  that 
the  output  memories  of  a  NARX  network  can  be  manifested  as  jump-ahead 
connections  in  the  time-unfolded  network. 

Here  we  show  that  similar  improvements  in  learning  long-term  dependen¬ 
cies  can  be  achieved  with  other  classes  of  recurrent  neural  network  architectures 
simply  by  increasing  the  order  of  the  embedded  memory.  Experiments  with  lo¬ 
cally  recurrent  networks,  and  NARX  (output  feedback)  networks  show  that  all 
of  these  classes  of  network  architectures  can  have  a  significant  improvement 
on  learning  long-term  dependencies  as  the  orders  of  embedded  memory  are  in¬ 
creased,  other  things  be  held  constant.  These  results  can  be  important  to  a  user 
comfortable  with  a  specific  recurrent  neural  network  architecture  because  sim¬ 
ply  increasing  the  embedding  memory  order  of  that  architecture  will  make  it 
more  robust  to  the  problem  of  long-term  dependency  learning. 


1  Introduction 

Recurrent  Neural  Networks  (RNNs),  though  capable  of  representing  arbitrary  non¬ 
linear  dynamical  systems  [25]  and  computationally  quite  powerful  [26],  can  some¬ 
times  have  difficulty  learning  even  simple  temporal  behavior.  Part  of  this  difficulty 
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has  been  attributed  to  the  problem  of  long-term  dependencies  [2,  1 9],  i.e.  those  prob¬ 
lems  for  which  the  desired  output  of  a  system  at  time  T  depends  on  inputs  presented 
at  times  t  <^T. 

In  particular  Bengio  et  al  [2]  showed  that  if  a  system  is  to  latch  information  ro¬ 
bustly,  then  the  fraction  of  the  gradient  in  a  gradient-based  training  algorithm  due  to 
information  n  time  steps  in  the  past  approaches  zero  as  n  becomes  large.  This  effect 
is  called  the  problem  of  vanishing  gradient.  Bengio  et  al.  claimed  that  the  problem 
of  a  vanishing  gradient  is  the  essential  reason  why  gradient-descent  methods  are  not 
sufficiently  powerful  to  learn  long-term  dependencies. 

Several  approaches  have  been  suggested  to  circumvent  the  problem  of  vanishing 
gradients  in  training  RNNs:  presetting  initial  weights  by  using  prior  knowledge  [6, 
9],  alternative  optimization  methods  instead  of  gradient-based  [2],  reduced  descrip¬ 
tion  of  data  [19,  23,  24],  architectures  that  operate  on  multiple  time  scales  [10,  11] 
and  architectures  with  high-order  gating  units[12]. 

A  class  of  recurrent  neural  networks  called  NARX  networks  can  perform  much 
better  at  learning  long-term  dependencies  when  using  a  gradient  descent  training 
algorithm  [17].  The  intuitive  explanation  for  this  behavior  is  that  the  output  mem¬ 
ories  of  a  NARX  neural  network  are  manifested  as  jump-ahead  connections  in  the 
time-unfolded  network  that  is  often  associated  with  algorithms  as  Backpropagation 
Through  Time  (BPTT).  These  jump-ahead  connections  provide  shorter  paths  for 
propagating  gradient  information,  thus  reducing  the  sensitivity  of  the  network  to 
long-term  dependencies. 

We  hypothesize  that  the  similar  improvement  on  learning  long-term  dependen¬ 
cies  can  be  achieved  in  other  classes  of  recurrent  neural  network  architectures  by 
increasing  the  orders  of  embedded  memory.  (One  of  the  first  uses  of  embedded 
memory  in  recurrent  network  architectures  was  that  of  Jordan  [14].)  In  this  paper, 
we  empirically  justify  this  hypothesis  by  showing  the  relationship  between  mem¬ 
ory  order  of  a  RNN  and  its  sensitivity  to  long-term  dependencies.  In  Section  2, 
we  discuss  three  classes  of  conventional  recurrent  neural  networks  architectures: 
globally  recurrent  networks  (the  architecture,  not  the  training  procedure,  used  by 
Elman)  [5];  locally  recurrent  networks  (in  particular  the  Frasconi,  Gori  and  Soda’s 
model)  [7];  NARX  networks  [3,  21],  and  their  corresponding  models  with  a  high 
order  embedded  memory.  In  Section  3,  we  provide  a  empirical  comparison  of  these 
architectures  by  investigating  their  performance  on  learning  two  simple  long-term 
dependencies  problems:  the  latching  problem  and  a  grammatical  inference  problem. 
These  simulations  show  that  these  classes  of  recurrent  neural  network  architectures 
all  demonstrate  significant  improvement  on  learning  long-term  dependencies  when 
the  embedded  memory  order  is  increased  and  weights  remain  relatively  the  same. 
Thus,  a  user  of  one  of  these  recurrent  architectures  can  readily  improve  their  robust¬ 
ness  to  long-term  memory  problems  simply  by  increasing  the  amount  of  embedded 
memory,  all  other  variables  remaining  constant. 
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2  Embedding  memory  order  in  recurrent  neural  network  archi¬ 
tectures 

Several  recurrent  neural  network  architectures  have  been  proposed;  for  a  collection 
of  papers  on  the  variety  see  [8],  One  taxometric  classification  for  these  architectures 
can  be  based  on  the  observability  of  their  states:  specifically  they  can  be  broadly 
divided  into  two  groups  depending  on  whether  or  not  the  states  of  the  network  are 
observable  or  not  [13].  For  another  taxometric  approach  based  on  memory  types,  see 
Mozer  [20].  For  this  study  we  picked  three  classes  of  networks:  globally  recurrent 
(GR)  networks  [5],  locally  recurrent  networks  (LR)  [7],  and  NARX  networks  [3, 
21];  and  their  corresponding  architectures  with  high-order  embedded  memory.  It 
should  be  pointed  out  that  our  embedded  memory  simply  consists  of  simple  tapped 
delayed  values  to  various  neurons  and  not  more  sophisticated  embedded  memory 
structures  [20,  4].  NARX  networks  are  a  typical  model  of  networks  with  observable 
states.  GR  networks  are  a  popular  class  of  network  with  globally  connected  hidden 
states,  and  LR  networks  belong  to  locally  recurrent  network  architecture  class  also 
with  hidden  states. 

2.1  Globally  connected  RNNs 

These  networks  (which  we  will  call  GR  networks)  are  a  class  of  recurrent  networks 
in  which  the  feedback  connections  come  from  the  state  vector  to  the  hidden  layer, 
as  illustrated  in  Figure  1  (a).  These  hidden  states  are  sometimes  called  context  units 
in  the  literature.  Suppose  such  a  network  with  Uu  input  nodes,  rih  hidden  nodes  of, 
and  Tiy  output  nodes,  the  dynamic  equation  can  be  described  by: 

(Uh  Tlu  \ 


y^{t)  =  f 

where  o{t)  and  y{t)  denotes  the  real  valued  outputs  of  the  hidden  and  output  neurons 

at  time  t,  and  /  is  the  nonlinear  function. 

This  network  with  a  high  order  of  embedded  memory  differs  from  standard  glob¬ 
ally  connected  recurrent  network  in  that  they  have  more  than  one  state  vector  per 
feedback  loop.  Specially,  for  a  GR  network  with  embedded  memory  of  order  m,  the 
dynamic  equations  of  hidden  nodes  become: 


+  w\ 


(2) 


(m  "h  "u  \ 

{t-k)  +  Yl  it)  +  ti'-  •  0) 

fc=lj=l  ^  k=l  / 

Figure  1  (b)  illustrates  an  GR  network  with  embedded  memory  of  order  two. 

2.2  Locally  recurrent  networks 

In  this  class  of  networks,  the  feedback  connections  are  only  allowed  from  neurons  to 
themselves,  and  the  nodes  are  connected  together  in  a  feed  forward  architecture  [1, 
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(b) 


Figure  1 :  (a)  A  standard  GR  network,  (b)  A  GR  network  with  embedded  memory 
of  order  two. 


15,  7,  22,  29].  Specifically,  we  consider  networks  proposed  by  Frasconi  etal.  [7] 
(we  will  call  LR),  as  shown  in  Figure  2  (a).  The  dynamic  neurons  of  LR  networks 
can  be  described  by 


Oi{t)  =  f 


- 1)  +  Kj'^j  (0 + '^i 


(4) 


where  Oi{t)  denotes  the  output  of  the  node  at  time  t,  and  /  is  the  nonlinearity.  For 
a  network  with  embedded  memory  of  order  m,  the  output  of  the  dynamic  neurons 
becomes 


Oiit)  =  / 


—  n)  +  ^  +  1^1* 


(^i) 


Figure  2  (b)  shows  a  LR  network  with  embedded  memory  of  order  two.  Locally 
recurrent  models  usually  differ  in  where  and  how  much  output  feedback  is  permitted; 
see  [29]  for  a  discussion  of  architectural  differences. 

2.3  NARX  recurrent  neural  networks 

An  important  class  of  discrete-time  nonlinear  systems  is  the  Nonlinear  AutoRegres- 
sive  with  eXogeneous  inputs  (NARX)  model  [3,  18,  27,  28]: 

y{t)  -  f  -  Du), . . .  ,u{t),y{t  -  Dy), ...  ,  y(i  -  1)  j  ,  (6) 

where  u{t)  and  y{t)  represent  input  and  output  of  the  network  at  time  t,  and  Dy 
are  the  input-memory  and  output-memory  order,  and  the  function  /  is  a  nonlinear 
function.  When  the  function  /  can  be  approximated  by  a  Multilayer  Perceptron,  the 
resulting  system  is  called  a  NARX  recurrent  neural  network  [3,  21]. 
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(a)  (b) 


Figure  2:  (a)  A  standard  LR  network,  (b)  A  LR  network  with  embedded  memory  of 
order  two. 


In  this  paper,  we  shall  consider  NARX  networks  with  zero  input  order.  Thus,  the 
operation  of  the  network  id  defined  by 

s,(()  =  /  (n(i),v(t  -  1))  •  (7) 

Figure  3  shows  a  NARX  architecture  with  output  memory  of  order  3. 

3  Experimental  Results 

Simulations  were  performed  to  explore  the  effect  of  embedded  memory  on  learn¬ 
ing  long-term  dependencies  in  these  three  different  recurrent  network  architectures. 
The  long-term  dependency  problems  investigated  were  the  latching  problem  and  a 
grammatical  inference  problem.  These  problems  were  chosen  because  they  are  sim¬ 
ple  and  should  be  easy  to  learn  but  exemplify  the  long-term  dependency  issue.  For 
more  complex  problems  involving  long-term  dependencies  see  [12]. 

In  order  to  establish  some  metric  for  comparison  of  the  experimental  results,  we 
gave  the  recurrent  networks  sufficient  resources  (number  of  weights  and  training  ex¬ 
amples,  adequate  training  time)  to  readily  solve  the  problem  but  held  the  the  number 
of  weights  approximately  invariant  across  all  architectures.  Also  note  that  in  some 
cases  the  order  of  the  embedded  memory  is  the  same. 

3. 1  The  latching  problem 

This  experiment  evaluates  the  performance  of  different  recurrent  network  architec¬ 
tures  with  various  order  of  embedded  memory  on  a  problem  already  used  for  study¬ 
ing  the  difficulty  in  learning  long-term  dependencies  [2,  1 1,  17]. 

This  problem  is  a  minimal  task  designed  as  a  test  that  must  necessarily  be  passed 
in  order  for  a  network  to  robustly  latch  information  [2].  In  this  two-class  problem, 
the  class  of  a  sequence  depends  only  on  the  first  3  time  steps,  the  remaining  values 
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u(t)  y{t-3)  y(t-2)  y(t-l) 


Figure  3:  A  NARX  network  with  output  memory  of  order  3. 


Architecture 

Network  Description 

#  weights 

Memory  order 

#  states 

#  hidden  neurons 

In-hid-out 

GR(1) 

1 

6 

6  nodes 

3-6-1 

85 

GR(2) 

2 

10 

5  nodes 

3-5-1 

91 

GR(3) 

3 

12 

4  nodes 

3-4-1 

81 

NARX{2) 

2 

2 

1 1  nodes 

3-11-1 

111 

NARX(4) 

4 

4 

8  nodes 

3-8-1 

97 

6 

6 

6  nodes 

3-6-1 

85 

1 

14 

14  nodes 

LR(2) 

2 

22 

11  nodes 

3-11-1 

LR(3) 

3 

27 

9  nodes 

3-9-1 

111 

Table  1 :  Architecture  description  of  different  recurrent  networks  used  for  the  latch¬ 
ing  problem.  We  used  the  hyperbolic  tangent  function  as  the  nonlinear  function  for 
each  neuron. 
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in  the  sequence  is  uniform  noise.  There  are  three  inputs  Ui{t),  U2{t),  and  a  noise 
input  e[t).  Both  ui{t)  and  U2{t)  are  zero  for  all  times  f  >  1.  At  time  f  =  1, 
ui(l)  =  1  and  ^2(1)  =  0  for  samples  from  class  1,  and  ui(l)  =  0  and  U2{1)  =  1 
for  samples  from  class  2.  The  class  information  of  each  strings  is  contained  in  ui  {t) 
and  U2{t).  We  used  two  delay  elements  for  both  ui  (i)  and  U2{t)  in  order  to  hold  the 
class  information  until  t  —  3.  The  noise  input  e{t)  is  given  by 


Jo  t<3 
~  \u{-b,h)  S<t<T 


(8) 


where  U{-b,b)  are  samples  drawn  uniformly  from  [-0.155,0.155].  Target  infor¬ 
mation  was  only  provided  at  the  end  of  each  sequence.  For  comparison,  our  training 
particulars  are  identical  to  those  of  [2].  For  strings  from  class  one,  a  target  value  of 
0.8  was  chosen,  for  class  two,  -0.8  was  chosen.  The  length  of  the  noisy  sequence 
could  be  varied  in  order  to  control  the  span  of  long-term  dependencies.  For  our  ex¬ 
periment,  the  input  sequences  were  1  and  0  and  were  one-hot  encoded  into  two  input 
neurons  with  trainable  weights. 

For  each  of  these  three  architectures  previously  discussed,  several  networks  with 
different  orders  of  embedded  memory  were  trained.  To  compare  the  effects  of  dif¬ 
ferent  orders  of  embedded  memory  in  every  class  of  networks  on  learning  long-term 
dependencies  while  holding  as  many  other  factors  as  possible  constant,  particular 
attention  was  paid  to  equalize  the  number  of  weights.  Table  1  gives  a  detailed  de¬ 
scription  of  all  networks  used  in  the  latching  problem.  The  weight  connected  the 
noisy  input  was  fixed  as  1.0.  In  order  to  learn  the  task,  the  networks  have  to  develop 
two  attractors  to  latch  the  information  and  still  remain  inside  the  basin  of  the  attrac¬ 
tors  of  being  resistant  to  noise  when  t  >  3.  The  ability  of  learning  this  minimal 
problem  is  a  measure  of  the  effectiveness  of  propagating  the  gradient  for  different 
neural  network  architectures  with  various  memory  orders. 

The  length  of  noisy  inputs,  T,  was  varied  from  10  to  60  in  increments  of  2. 
For  each  value  of  T,  we  ran  50  simulations.  For  each  simulation,  30  strings  were 
generated  from  each  class  and  the  initial  weights  were  randomly  distributed  in  the 
range  [—0.5, 0.5]. 

The  network  was  trained  with  a  MSE  cost  function  using  simple  BPTT  algorithm 
with  a  learning  rate  of  0.1  for  a  maximum  of  200  epochs.  Updates  occurred  at  the 
end  of  each  string  and  the  error  was  back-propagated  the  full  length  of  the  string.  If 
the  absolute  error  between  the  output  of  the  network  and  the  target  value  was  less 
than  0.6  on  all  strings,  the  simulation  was  terminated  and  determined  successful.  If 
the  simulation  exceeded  200  epochs  and  did  not  correctly  classify  all  strings,  then 
the  simulation  was  ruled  a  failure. 

Figures  4  (a)  to  (c)  show  plots  of  the  percentage  of  those  runs  that  were  suc¬ 
cessful  for  different  classes  of  networks  with  different  orders  of  embedded  memory. 
It  is  clear  from  these  plots  that  the  network  architectures  with  high  order  embed¬ 
ded  memory  become  increasingly  less  sensitive  to  long-term  dependencies  as  the 
memory  order  was  increased. 
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Figure  4:  Plots  of  percentage  of  successful  simulations  on  the  latching  problem  from 
50  runs  as  a  function  of  T,  the  length  of  input  strings,  for  different  recurrent  network 
architectures  with  different  orders  of  embedded  memory:  (a)  Globally  connected 
RNN  (GR) ,  (b)  Locally  connected  RNN  (LR),  (c)  NARX,  (d)  NARX  v,s.  GR(1). 


An  interesting  comparison  between  the  architectures  GR(1)  and  NARX(6)  is 
shown  in  Figure  4  (d).  Since  the  two  architectures  have  the  exact  same  number 
of  weights,  hidden  nodes,  and  states,  the  only  difference  is  the  amount  of  memory 
order.NARX  networks  perform  better  than  the  GR  networks  at  learning  the  latching 
problem. 

3.2  Grammatical  Inference  (TYee  Automata)  Problem 

In  previous  problem,  the  inputs  to  the  network  were  followed  by  a  noise  term.  In 
this  experiment,  we  consider  learning  to  classify  strings  of  boolean  values,  which 
are  labelled  according  to  some  prespecified  automata. 

In  this  example,  the  class  of  a  string  is  completely  determined  by  its  input  symbol 
at  some  prespecified  time  t.  For  instance,  Figure  5  shows  a  five-state  automaton  used 
in  the  experiments,  in  which  the  class  of  each  string  is  determined  by  the  third  input 
symbol.  When  that  symbol  is  “1”,  the  string  is  accepted;  otherwise,  it  is  rejected. 
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0,1 


Figure  5:  A  five-state  tree  automaton.  The  unlabeled  arrow  is  the  start  state  and  the 
double  circled  state  is  the  the  acceptance  state. 

By  increasing  the  length  of  the  strings  to  be  learned,  we  will  be  able  to  control  the 
span  of  long-term  dependencies,  in  which  the  output  will  depend  on  input  values  far 

in  the  past.  j  •  u 

Again,  we  noted  the  same  improvement  on  learning  long-term  dependencies  ob¬ 
tained  by  increasing  the  order  of  embedded  memory  in  each  class  of  recurrent  neural 
network  architectures.  For  more  details  regarding  the  experiment,  please  see  [16]. 

4  Conclusion 

Motivated  by  the  analysis  of  the  problem  of  learning  long-term  dependencies  and 
the  success  of  NARX  networks  on  problems  including  grammatical  inference  and 
nonlinear  system  identification  [  1 3],  we  explore  the  ability  of  other  recurrent  neural 
networks  with  a  high  order  of  embedded  memory  on  problems  that  involve  long¬ 
term  dependencies.  We  chose  three  classes  of  recurrent  neural  network  architectures 
based  on  state-observerability:  hidden  state  globally  recurrent  and  locally  recurrent 
networks,  and  observeable  state  NARX  networks. 

We  tested  this  approach  of  extending  memory  in  conventional  recurrent  neu¬ 
ral  networks  on  two  simple  long-term  dependency  problems.  Our  experimental  re¬ 
sults  show  that  each  of  these  classes  of  recurrent  neural  networks  architectures  can 
demonstrate  significant  improvement  on  learning  long-term  dependencies  when  the 
memory  order  of  the  network  is  increased. 

The  intuitive  explanation  for  this  behavior  is  that  the  embedded  memories  are 
manifested  as  jump-ahead  connections  in  the  unfolded  network  that  is  often  used  to 
describe  algorithms  like  Backpropagation  Through  Time.  These  jump- ahead  con¬ 
nections  provide  a  shorter  path  for  propagating  gradient  information,  thus  reducing 
the  sensitivity  of  the  network  to  long-term  dependencies.  Another  explanation  is 
that  the  states  do  not  necessarily  need  to  propagate  through  nonlinearities  at  every 
time  step,  which  may  avoid  a  degradation  in  gradient  due  to  the  partial  derivative 
of  the  nonlinearity.  We  speculate  that  using  increased  memory  order  will  also  help 
other  recurrent  network  architectures  on  learning  long-term  dependency  problems. 
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Abstract 

In  current  practice,  tapped  delay  line  models  such  as  the  time  delay 
neural  network  (TDNN)  are  commonly  implemented  using  a  direct  form 
structure.  In  this  paper,  we  show  that  the  problem  of  high  parameter 
sensitivity,  well  known  in  linear  systems,  also  applies  to  nonlinear  models 
such  as  the  TDNN.  To  overcome  the  consequent  numerical  problems,  we 
propose  a  cascade  form  TDNN  (CTDNN)  and  show  its  advantages  over 
the  commonly  used  direct  form  TDNN. 


1  Introduction 

In  signal  processing  and  control  applications,  there  is  much  interest  in  nonlinear 
adaptive  filters  based  on  neural  networks.  The  tapped  delay  line  has  been 
employed  in  [10]  to  modify  the  classic  multilayer  perceptron  (MLP)  for  signal 
processing.  It  is  easily  observed^  that  the  tapped  delay  line  can  be  considered 
as  a  finite  impulse  response  (FIR)  filter  [4,18]: 

m 

!/(0  =  (1) 

2  =  0 

^Note  that  we  use  notation  which  includes  both  time  and  the  g-operator,  as  is  standard 
practice  in  the  literature. 
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where  hi  are  constants  and  q  ^x(t)  =  x{t  ~  1).  Eqn  (1)  can  be  equivalently 
written  as 

m 

y(<) (2) 

1=1 

If  Ci  is  a  complex  number,  then  it  will  appear  with  its  complex  conjugate  for 
real  y(t). 

This  model  has  been  proven  to  be  capable  of  universally  approximating  a 
functional  [7,16].  However,  for  real  world  applications,  there  are  a  number  of 
other  aspects  which  also  need  to  be  considered  when  implementing  cost  effective 
models  in  hardware.  One  area  normally  considered,  is  the  round-off  error  and 
quantization  error  due  to  the  finite  arithmetic  wordlengths. 

It  is  known  in  the  digital  signal  processing  literature  [14, 19]  that  with  the 
realization  of  FIR  and  HR  (infinite  impulse  response)  filters  using  devices  capa¬ 
ble  of  only  finite  precision  arithmetic,  (1)  is  more  sensitive  to  round-off  effects 
and  coefficient  quantization  than  (2).  Eqn  (1)  is  known  as  a  direct  form  imple¬ 
mentation,  while  (2)  is  known  as  a  cascade  form  [19]. 

While  the  sensitivity  properties  of  linear  filters  are  well  known,  in  general,  it 
appears  that  this  area  has  not  been  considered  in  neural  networks.  It  appears 
to  be  quite  common  that  experiments  are  conducted  using  time  delay  neural 
networks  with  a  direct  form  structure  and  not  to  consider  potential  problems  of 
numerical  effects  such  as  parameter  sensitivity.  Hence,  in  this  paper  we  inves¬ 
tigate  the  sensitivity  of  TDNN  networks  which  are  based  on  (1)  and  propose  a 
new  class  of  TDNN  architecture  based  on  the  cascade  form  model.  The  main 
aim  of  the  paper,  is  therefore  to  demonstrate  the  validity  and  usefulness  of 
using  cascade  form  structures  in  time  delay  neural  networks. 


2  A  Cascade  Form  Time  Delay  Neural  Network 

As  implied  in  (1),  we  may  equivalently  formulate  the  input  layer  of  the  TDNN^ 
as  FIR  filters  Goj(q)  from  the  input  to  the  jth  unit  in  the  first  hidden  layer. 

This  approach  allows  us  to  easily  consider  various  extensions  to  the  basic 
TDNN  structure.  Hence  we  may  have 


rib 

Go{q)  = 

1=0 


(3) 


^For  convenience  we  consider  a  single-input  single-output  model,  with  a  tapped  delay  line 
occurring  only  at  the  input  layer,  though  further  extensions  could  be  derived. 
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Eqn  (3)  may  be  written  equivalently  as  a  cascade  form  structure  as  follows 

M 

Go{q)  =  U^ojig)  (4) 

3 

where  typically,  {5o(^)}  is  a  first  or  second  order  section  of  the  form  (1  Ciq  ), 
where  Ci  may  be  real  or  complex;  if  c*  is  complex,  then  it  must  occur  in  a 
complex  conjugate  pair  for  a  real  output  signal.  This  structure  is  a  specialized 
form  of  the  TDNN,  which  we  term  a  cascade  form  time  delay  neural  network 
(CTDNN).  Hence  we  term  the  model  represented  by  (3)  a  direct  form  TDNN 
(DTDNN).  The  results  can  be  applied  equally  well  to  the  full  TDNN  structure, 
but  for  clarity  we  consider  this  simpler  model  form. 

3  Parameter  Sensitivity  Analysis  of  Time  Delay 
Neural  Networks 

In  neural  networks,  the  issue  of  sensitivity  to  errors  in  the  weights  has  typically 
been  approached  in  a  probabilistic  framework  [2,8,9,17].  Here,  the  approach 
used  is  to  extend  the  usual  method  of  parameter  sensitivity  analysis  used  in 
linear  systems  based  on  the  model  poles  and  zeros,  to  nonlinear  systems.  Al¬ 
though  the  use  of  poles  and  zeros  in  nonlinear  systems  may  be  thought  by  some 
to  be  inappropriate,  this  is  not  the  case  as  shown  in  [5,6]  (see  also  [15]). 

The  method  presented  in  [6]  is  based  on  approximating  a  particular  class  of 
nonlinear  system  described  by 

yit)  =  F{G(q)x{t))  (5) 

where  F(-)  is  a  memoryless  nonlinearity  and  for  G{q)  =  Bo(q),  (5).  In  this  case, 
(5)  is  a  special  case  of  the  TDNN.  If  G{q)  is  a  SIMO  (single-input  multiple- 
output)  structure,  i.e.  a  parallel  filter  bank,  and  F[-)  is  a  MISO  nonlinear  niap, 
then  the  more  general  form  of  TDNN  is  obtained.  For  clarity  of  presentation, 
but  without  loss  of  generality,  we  consider  the  case  where  G(q)  =  Bo{q)  and 
F(-)  is  SISO. 

We  may  approximate  F{')  by  a  power  series  expansion,  giving 

n 

y{t)  =  Fgijxjt))  T  ‘-F gpjx^jt)) 

k=0 

-\-0  -  l)...a?*”(^  -  n))  +  (6) 

where  {^}  is  the  set  of  parameters  obtained  from  the  approximation  of  F(-) 
and 

gi[x^[t))  =  ^j{q)xHt)  (7) 
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(8) 


k=0 

Hence  the  TDNN  model  can  be  approximated  by  a  summation  of  subsystems 
which  employ  linear  transfer  functions  as  shown  in  (8).  The  parameter 

sensitivity  measure  Sij  normally  used  for  linear  systems  [12],  can  be  applied  to 
the  subsystems  which  form  the  approximate  TDNN  model  and  consequently 
to  the  TDNN.  The  parameter  sensitivity  measure  is  defined  as 

c  ^ 

“  dbj 

\n-j 

=  - -  (9) 

where  Sij  is  the  sensitivity  of  the  zth  root  Aj,  with  respect  to  the  parameter  bj  of 
the  polynomial  B(q).  A  high  parameter  sensitivity  implies  that  a  small  change 
in  a  parameter  will  lead  to  a  large  change  in  the  model  behaviour,  as  determined 
by  the  roots  of  the  polynomial  of  the  FIR  filter  [1,3].  Hence  problems  of  round¬ 
off  error  and  coefficient  quantization  may  become  significant.  So  also,  errors 
can  be  introduced  into  adaptive  models  as  the  weights  are  updated,  but  the 
model  behaviour  differs  from  that  required  by  the  update. 

Therefore  the  following  observations  can  be  made: 

Observation  1.  A  DTDNN  model  is  subject  to  possible  problems  of  high 
parameter  sensitivity. 

Observation  2.  If  a  method  can  be  given  which  reduces  the  sensitivity  of  a 
linear  model,  then  the  same  technique  will  reduce  the  parameter  sensitiv¬ 
ity  of  the  subsystems  within  the  approximate  TDNN  model  and  hence, 
of  the  TDNN  model  itself. 

We  may  define  the  parameter  sensitivity  of  a  DTDNN  in  terms  of  the  sen¬ 
sitivities  of  the  subsystems  within  the  approximate  TDNN  model. 

Definition  1  The  sensitivity  Sn  of  a  DTDNN  with  M  tapped  delay  inputs  and 
No  input  units  is  given  by 

Sn  =  max(5'i  :  2  =  1, ...,  A'o)  (10) 

i 

Si  =  max(.5'ij  :  i  =:  1, M)  (11) 

3 

Definition  2  The  sensitivity  Snc  of  a  CTDNN  with  (M  -|-  l)/2  (M  odd)  or 
(M  -h  2)/2  (M  even)  first  or  second  order  filter  sections  going  to  each  of  No 
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input  units  is  given  by 


5yvc  =  niax(5ci  :  *  =  1, iVo)  (1^) 

i 

S,i  =  max  {S,ij  :j=(M  +  l)/2(M  odd),  (M  +  2)/2  (M  even))  (13) 

3 

Note  that  Sij  is  the  sensitivity  of  an  Mth  order  direct  form  filter,  while  Sdj 
is  the  sensitivity  of  a  2nd  order  filter.  Hence  we  obtain  the  following  theorem. 

Theorem  1  The  sensitivity  Sn  of  a  direct  form  TDNN  is  less  than  or  equal 
to  the  sensitivity  Snc  of  a  cascade  form.  TDNN  . 

Proof.  From  (9),  we  have,  for  j  =  n,  the  last  stage  in  an  n  tap  delay  line: 

c.  _  _ 1 -  (14) 

It  is  sufficient  to  consider  two  cases: 


1.  |Ai  -  Aa-I  >  0  In  this  case,  Sij  will  be  small,  which  implies  that  the  dif¬ 
ference  between  the  sensitivities  for  cascade  form  and  direct  form  models 
will  be  of  negligible  consequence. 

2.  0  Pi?  |Ai  -  Ajtl  <  1  In  this  case,  for  an  arbitrarily  large  tapped  delay  order 
n, 


max(jS'jj)  ^  lim  __  ..  .  . 

Ac -4  00  Yik  lAi  —  A/cj 


(15) 

(16) 


For  a  CTDNN,  n  =  2,  it  is  evident  that 

Snc  <  Sn 

Hence,  when  it  matters  most,  i.e.  for  systems  exhibiting  high  sensitivity 
as  in  case  2,  the  CTDNN  will  provide  better  sensitivity  properties  than  an 
equivalent  DTDNN.  This  effect  will  in  general,  be  greater  as  the  order  of  the 
tapped  delay  line  increases. 


4  Examples 

To  better  elucidate  the  difference  between  the  networks^,  we  will  give  an  ex¬ 
ample  to  illustrate  the  possible  performance  differences  between  CTDNN  and 

3  We  would  like  to  point  out  however,  that  while  the  problems  considered  here  are  ar¬ 
tificially  constructed,  they  are  meant  to  serve  as  an  example  of  possible  behaviour.  For 
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DTDNN  models.  We  consider  a  system  identification  problem"^  where  we  have 
a  system  to  be  identified  S  and  two  models  Md  and  Me  which  correspond  to  a 
DTDNN  and  a  CTDNN  respectively.  In  order  to  illustrate  the  phenomenon,  we 
assume  a  perfect  representation  in  each  case,  i.e.,  no  training  is  involved.  We 
consider  two  models  and  the  effect  of  coefficient  quantization  on  the  accuracy 
of  the  models  in  each  case. 

We  consider  the  difference  between  the  DTDNN  and  a  CTDNN  when  sub¬ 
jected  to  finite  wordlength  (FWL)  restrictions.  In  this  experiment,  we  consider 
a  TDNN  with  10  tap  delays  and  1  hidden  unit  The  corresponding  polynomial 
is  given  by 

G{q)  =  1.000  -  4.200g"^  +  8.280^“^  -  10.524^"^  -H  lO.OOl^"'^  -  8.0909~^ 
+5.629g“®  -  3.212g-^  +  1.329^^  -  0.338^“®  +  0.039^"^°  (18) 

For  each  of  the  direct  form  and  cascade  form  models,  word  lengths  of  16  and 
8  bits  were  tested.  The  method  used  to  round  off  the  weights  is 

$  =  2-“'round(2"^0)  (19) 

where  0  is  the  original  weight,  0  is  the  FWL  weight  and  w  is  the  word  length 
in  bits. 

The  behaviour  of  the  models  are  compared  by  examining  the  shift  in  the 
position  of  the  zeros  (roots)  of  the  FIR  filter  polynomials  at  the  input  of  the 
TDNNs  and  comparing  them  to  the  position  of  the  true  zeros.  These  zeros  are 
plotted  in  Fig  1,  where  it  can  be  observed  the  conventional  direct  form  TDNN 
indicates  significant  movement  in  the  zeros.  The  cascade  form  model  however, 
performs  very  well  and  does  not  appear  to  suffer  significantly  from  the  reduced 
precision  in  the  weights. 

Note  that  the  zeros  are  not  shifted  by  much  in  the  regions  where  the  sensi¬ 
tivity  is  low.  The  regions  of  high  sensitivity  (as  expected  from  the  sen.sitivity 
analysis  -  see  (9)  and  Theorem  1),  are  where  the  zeros  have  large  magnitudes 
and  are  close  together,  i.e.  for  real  valued  .systems,  this  is  near  the  (1,0)  point. 

It  has  also  been  observed  in  other  experiments  (not  .shown  here  due  to  lack  of 
space)  that  the  direct  form  model  sometimes  obtains  nonminimum  phase  zeros. 
For  system  identification  and  control  applications  this  can  cause  problems  [11]. 

some  applications,  there  may  be  negligible  difference  between  the  architectures,  however  we 
propose  that  the  choice  is  up  to  the  user.  If  one  wishes  to  avoid  the  potential  problems 
indicated  in  this  paper,  then  we  propose  that  the  cascade  form  or  parallel  form  TDNNs  are 
preferable  to  the  direct  form  TDNN.  If  one  is  confident  that  the  application  will  not  present 
a  problem  for  direct  form  structures,  then  the  direct  form  may  be  used.  It  is  noted  that 
except  in  speech  synthesis,  direct  forms  are  infrequently  used  other  than  in  second  order 
sections  [13].  This  is  possible  because  in  speech  synthesis  the  poles  of  the  system  function 
are  widely  separated  [14], 

"^Due  to  lack  of  space,  only  one  experiment  will  be  shown  in  this  section,  however  other 
experiments  indicate  the  same  type  of  problem. 
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Clearly,  even  in  simple  TDNNs,  the  accuracy  degrades  rapidly  when  using 
a  smaller  number  of  bits  in  the  weights.  For  real-world  applications,  e.g.  au¬ 
tomotive  or  other  consumer-oriented  products,  this  issue  could  be  of  concern 
where  longer  wordlengths  add  undesirable  cost  to  the  product.  On  the  other 
hand,  in  areas  where  a  higher  precision  computing  resources  are  available,  the 
problem  may  be  viewed  in  a  slightly  different  manner.  Here,  for  any  given 
wordlength,  when  modelling  low  frequency  systems,  the  cascade  TDNN  offers 
a  potentially  higher  accuracy  than  the  direct  form  structure. 

It  should  of  course  be  noted  that  the  use  of  cascade  form  filters  is  wide¬ 
spread  in  digital  signal  processing,  and  as  such,  the  neural  network  structure 
proposed  here  should  come  as  no  surprise  to  those  familiar  with  such  methods. 
However,  in  view  of  the  widespread  usage  of  the  classical  TDNN  (direct  form) 
structure,  we  felt  that  it  may  be  worth  drawing  to  the  attention  of  the  neural 
network  community,  the  potential  advantages  of  some  slight  modifications  to 
the  structure  of  the  TDNN.  Moreover,  the  advantages  of  cascade  form  struc¬ 
tures  would  also  carry  over  to  multilayer  structures.  In  related  work,  we  have 
shown  that  alternative  discrete-time  operators  can  be  used  to  produce  lower 
sensitivity  structures  in  both  feedforward  and  recurrent  networks  [5,6]. 

We  do  not  consider  the  learning  problem  in  this  paper,  however,  based  on  the 
numerical  improvements  to  the  architecture,  we  would  expect  that  there  should 
also  be  enhanced  performance  in  on-line  learning.  This  would  apply  particu¬ 
larly,  of  course,  in  FWL  modelling  of  high  sensitivity  systems.  In  any  training 
algorithms,  the  advantages  would  accrue  from  using  the  cascade  structure  di¬ 
rectly,  as  opposed  to  computing  the  poynomial  weights  on  a  highly  accurate 
computer,  transforming  them  and  then  ‘downloading’  them  to  the  cascade  form 
model,  although  such  an  approach  could  be  used  if  required  for  some  special 
purpose. 


5  Conclusions 

Current  neural  network  models  which  process  spatial  or  temporal  data  typically 
use  direct  form  substructures.  In  this  paper  we  have  proposed  a  new  class 
of  model  which  generalizes  the  classical  TDNN  structure  to  allow  an  input 
structure  which  consists  of  synapses  such  as  cascade,  parallel  or  cascade-parallel 
filters  instead  of  the  conventional  direct  form  structure. 

The  advantage  of  this  approach  is  particularly  evident  in  terms  of  reducing 
sensitivity  to  quantization  errors  in  the  network  weights.  The  resulting  network 
will  have  a  higher  accuracy  than  the  conventional  direct  form  network  for  a 
given  finite  word  length  implementation.  This  is  most  evident  when  the  system 
being  modelled  has  low  frequency  characteristics. 
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Direct  Form  16  bit 


Direct  Form  8  bit 


1 


_ _ _ .  •••• . t-::_ _ . 

-1  -0.5  0  0.5  1 


(a) 


Cascade  Form  16  bit 


(c) 


-1  -0.5  0  0.5  1 

(b) 


Cascade  Form  8  bit 


(d) 


Figure  1:  The  zeros  for  the  polynomials  are  shown  for  (a)  16  bit  direct  form,  (b) 
8  bit  direct  form,  (c)  16  bit  cascade  form,  (d)  8  bit  cascade  form  models.  In  each 
case,  the  true  zeros  are  shown  for  reference.  To  distinguish  the  true  zeros  and  the 
model  zeros,  ’x’  symbols  are  used  to  indicate  the  model  zeros.  Here  it  can  be  noted 
that  the  cascade  model  is  almost  identical  to  the  true  system,  while  the  direct  form 
is  significantly  different.  The  cascade  model  gives  significant  advantages  over  the 
direct  form  model.  The  main  differences  occur  in  the  low  frequency  region  however, 
indicating  problems  in  modelling  low  frequency  data. 
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Abstract 

Projection  Pursuit  [7]  [10]  (PP)  techniques  are  used  to  search  for 
statistically  interesting  low-dimensioned  projections  of  complex,  high¬ 
dimensional  data.  These  projections  reveal  data  structure  useful  for  au¬ 
tomatic  classification  applications.  We  derive  a  novel  class  of  Projection 
Pursuit  algorithms,  comparing  them  with  related  PP  algorithms  [7]  [8] 
[11]  [2]  [1].  Texture-based  cloud  detection  in  Airborne  Visible/Infrared 
Imaging  Spectrometer  (AVIRIS)  imagery  from  the  Jet  Propulsion  Lab¬ 
oratory  is  provided  as  a  basis  for  inter-comparison. 


1  Projection  Pursuit:  Background 

In  order  to  identify  potentially  meaningful  data  structures  in  high-dimensional 
data  sets,  “Projection  Pursuit”  [7]  (PP)  techniques  are  used  to  search  for 
statistically  interesting  low-dimensional  projections.  PP  is  an  iterative  search 
technique  that  converges  to  extrema  of  a  projection  index  or  cost  function, 

*This  work  was  supported  in  part  by  a  grant  of  High  Performance  Computing  (HPC) 
time  from  the  foUowing  Department  of  Defense  HPC  Centers:  Army  Research  Laboratory 
SGI  Power  ChaUenge  Array,  the  Maui  High  Performance  Computing  Center,  and  the  Naval 
Research  Laboratory  SGI  Origin  2000. 
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that  measures  the  degree  of  multi-modality  or  departure  from  normality  of 
the  projected  data  distribution.  One  of  the  first  PP  methods  was  proposed 
by  Friedman  and  Tukey  [7],  who  coined  the  term  “Projection  Pursuit.”  Their 
cost  function  was  the  product  of  two  functions,  one  that  measures  the  spread 
of  the  projected  data  (a  trimmed  variance  to  ensure  insensitivity  to  outliers), 
and  another  that  measures  compactness  within  a  particular  distance  scale. 
Historically,  the  Friedman-Tukey  cost  function  can  be  seen  as  an  innovative 
step  beyond  Principal  Component  Analysis  (PCA).  As  shown  in  Figure  1, 
using  PCA  to  find  maximal  data  variance  of  all  data  samples  is  not  necessarily 
the  most  informative  for  classification.  This  Figure  also  emphasizes  the  fact 
that  an  orthonormal  decomposition  does  not  always  reveal  the  most  relevant 
information  from  the  perspective  of  class  separation. 


Projection  Pursuit  vs.  Principal  Components 


Projection  Distributions: 


PP  Vector  PC  Vector  1  PC  Vector  2 


Non-orthogonal  Projections  May 
Best  Separate  Data 


Figure  1:  (Left)  PP  vs.  PCA:  PCA  corresponds  to  a  special  case  of  PP  in  which  the 
Projection  Index  is  maximad  variance  (power),  which  will  not  always  reveal  clusters  in 
the  data.  More  sophisticated  Projection  Indices  can  be  defined  to  favor  the  discovery 
of  multi-modal  projected  distributions.  (Right)  Three  classes  of  data  located  in  distinct 
clusters  optimally  separated  by  hyperplcuies  (dashed  lines)  perpendicular  to  the  projec¬ 
tion  vectors;  hyperplanes  correspond  to  scalar  thresholds  of  projected  data;  note  that 
optimcJ  PP  vectors  are  not  orthogonal. 


For  a  unit  projection  vector  iD*,  input  data  sample  /i,  and  projection: 

Cfc(i)  =  • /i,  (1) 

the  Friedman  and  Tukey  PP  algorithm  is: 

Maximize:  I (ck)  =  S(ck)N{ck)  (2) 
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where: 


S{ck)  = 

V  Np-n 

(3) 

Np  Np 

N(c,)  = 

^  -  f'k (iJ)) 

(4) 

with  rk(i,j)  = 

t— 1  J  — i 

1  Ck(i)  -  Ck(j)  1, 

(5) 

9{Mhi)]  ■■ 

Monotone  Decreasing,  9  :  a  step  function 

(6) 

Np  is  the  total  number  of  patterns,  n  is  a  small  fraction  of  outliers  removed  at 
both  extremes  of  the  projection,  E(ck)  is  the  mean  projection  value,  and  i?  is  a 
scalar  cut-off  outside  of  which  pairs  of  points  are  excluded  in  the  compactness 
function,  N{ck)-  The  resulting  search  algorithm  favors  the  discovery  of  multi¬ 
modal  structure  in  the  data.  In  the  absence  of  the  factor,  N{ck)i  the  original 
PP  index  would  reduce  to  S{ck)  and  would  be  equivalent  to  PCA.  In  [7],  the 
search  procedure  consisted  of  optimizing  ID  or  2D  projections  of  the  data, 
one  at  a  time.  A  number  of  other  PP  methods  were  later  developed.  A 
fundamental  concept  in  these  later  algorithms  is  to  find  projections  that  are 
least  normal  [8]  [10].  Typically,  the  projection  vectors  in  these  methods  are 
optimized  serially  with  each  subsequent  search  vector  optimized  on  a  residual 
after  subtraction  of  structure  from  the  previous  projection.  In  this  sense, 
they  differ  from  unsupervised  neural  network  learning  algorithms  that  also 
implement  a  form  of  PP  but  do  so  by  jointly  optimizing  a  set  of  projections 
(see  e.g.  [11]).  No  proof  exists  to  suggest  whether  serial  or  joint  optimization 
is  superior. 

2  A  Novel  Class  of  Projection  Pursuit  Indices 

In  revisiting  the  original  PP  Index  designed  by  Friedman  and  Tukey,  one  can 
see  a  limitation:  the  factor  measuring  the  degree  of  data  spread,  S(ck),  does 
not  directly  focus  on  the  spread  between  clusters^  but  rather  measures  the 
spread  of  the  whole  data  set.  One  approach  to  circumventing  this  diflSculty  is 
to  define  an  entirely  new  projection  index  that  focuses  directly  on  the  degree 
of  departure  from  normality  (see  for  e.g.  [8]).  Alternatively,  our  approach  is 
to  replace  S{ck)  with  a  function,  D(ck),  that  directly  measures  the  spread 
outside  a  clustering/nearness  scale  ak-  The  symbol  Ck  refers  to  the  nonlinear 
data  projection  that  replace  the  linear  projections  Ck  defined  in  Equation  1: 

(7) 

3 
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where  a(x) 


a  tanh (aArc),  (a,  A  constant), 

-.(n— 1)  1(71) 

W)  ^  'C 


(8) 

(9) 


and  where  w^J^  is  the  jth  modifiable  projection  vector,  that  weights  the 

inputs  c  from  layer  n  -  1,  and  is  the  bias.  The  coupling/constraint 

matrix  l}^  is  either  fixed  or  modifiable  (see  the  next  Section).  The  nonlinear 
projections  are  no  longer  constrained  to  be  on  the  unit  sphere,  and,  impor¬ 
tantly,  are  expressed  as  saturating  nonlinearities  that  remove  sensitivity  to 
extreme  outliers  in  the  data.  In  fact,  this  latter  property  allows  us  to  retain 
data  points  originally  ignored  by  the  Priedman-Tukey  Index  (see  (3)  and  (4)). 

We  optimize  projections  jointly  as  in  PP  neural  network  approaches.  Our 
new  PP  Index  is: 


=  (10) 

^  k^\  ^  k=l 

where  and  are  given  by:  ^ 

iV-"*(c,)  ==  (11) 

pcc.n(.(j^)  ^  -5(f/i(/i, I/))  )  ]  (12) 

withrl(l,m)  =  (4(0  -  4(m))^  &  p(f)fc(/2,  i^))  =  e  ^  “fc  ^  (13) 

We  have  chosen  a  particular  form  for  the  nearness  function  5(4(A^5t^))  in 
order  to  derive  specific  expressions  for  a  particular  case  of  Equation  10.  In 
our  novel  PP  algorithm,  each  nearness  function  p(4(/^j*^))  has  a  clustering 
scale  factor  associated  with  it.  Each  ak  is  obtained  by  multiplying  an 
initial  estimate  of  the  standard  deviation  of  the  projected  data  by  a  random 
number  drawn  from  a  user- determined  range.  The  a*  can  be  modifiable, 
although  for  the  results  in  this  paper,  they  were  static.  Selecting  a  range 
of  ak  is  useful  because  clusters  and  other  structure  may  be  visible  on  more 
than  one  scale  in  the  data  depending  on  how  the  high-dimensional  data  is 
viewed,  i.e.  depending  on  the  orientation  of  the  PP  search  vector  in  the 
high- dimensional  data  space. 

In  (4),  we  use  the  squared  distance  weighted  by  the  factor  (1  - 

p(4(^)i^))  )j  since  this  product  will  directly  measure  inter-point  spread  of 
projected  data,  weighting  more  heavily  those  distances  outside  the  near¬ 
ness/clustering  scale  ak-  Thus,  it  will  be  maximal  for  well-separated  clusters 
existing  within  the  scale  ak-  If  the  projected  distribution  is  multi-modal, 
T)^°’^*  (4)  will  measure  the  spread  of  the  modes  better  than  5(cfc),  which 

^Here  means  expected  value  over  sample  pairs. 
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simply  measures  overall  data  spread  by  calculating  the  variance  about  the 
mean  of  the  projected  distribution. 

Minimization  of  our  PP  Index  in  (10)  by  gradient  descent  leads  to  the 
following  modification  equation  for  the  ith  projection  vector  w\  ^ : 


where: 


dDcont.^^in)) 


Epaivs,M[  giri‘{i^>''))  + 

(l-5(»'fc(/‘>'')))  ]  (15) 

ldfl(n,v) 

2 

(4”’(M)-5l"’(«^))- 

((cW(M))'cf-'*(^.)  -  (17) 

A(a-c<">)(a  +  cW).  (18) 


2.1  Implementation  Details 

Because  of  the  pairwise  calculations  inherent  in  the  quantities,  D,  N, 
and  ,  memory  and  computational  requirements  in  a  storage  intensive 
version  of  the  algorithm  would  be  potentially  quite  severe,  growing  quadrati- 
cally  with  the  number  of  sample  estimates  used  at  each  update.  This  potential 
problem  is  circumvented  by  an  on-line  implementation  that  uses  stochastic 
gradient  descent,  with  estimates  of  D,  N,  and  ^  computed  by  a 

running  average.  For  example: 

JV‘^‘>"‘(4)(t)  =  eN'°"‘(cfc)(t-l)  +  (l-6)s(f)=(ct(t),cji(t-l)))  (19) 
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By  choosing  a  small  and  decreasing  learning  rate,  r){t),  and  an  appropri¬ 
ate  value  of  e,  the  distribution  of  sample  pairs  can  be  estimated  sufficiently 
to  ensure  gradient  descent.  Typically,  we  let  the  learning  rate  decrease  as 
ri(t)  =  770/(1  +  /n(~)).  A  theoretical  justification  for  choosing  a  logarithmi¬ 
cally  decreasing  learning  rate  can  be  made  using  arguments  from  simulated 
annealing  [9]. 

Another  detail  of  the  implementation  is  that  the  coupling  matrix 
may  be  fixed  or  modifiable.  In  the  latter  case,  is  modified  by  gradient 
ascent  to  maximize  the  relative  entropy  of  the  projections,  c:  ^ 


ZetLW(O)  = 

/  1  -  /i,  for  i  =  j  \ 

) 

(20) 

and  (  = 

-^Y^^klniy),  with  Qk  =  ^(4"’  +  “)> 
k,l 

(21) 

II 

S:? 

<1 

dC 

(22) 

= 

cW)))- 

(a-c<">)£:p(a  +  2W)  ]. 

(23) 

If  the  scale  of  the 

initial  coupling  matrix,  set  by  is  sufficiently  larger  than 

the  scale  used  to  randomly  initialize  the  weights  Wij ,  and  the  learning  rates 
for  L  and  w  are  chosen  appropriately,  then  the  change  in  the  effective  projec¬ 
tion  vectors  dweffec  =  5[L'w)  =  L5w  {8L)w  =  LrjisVi2^  +  r]LiS will 
be  dominated  by  the  PP  gradient  term  with  adjustments  from  the  second 
term  dependent  on  VlC-  experiments  to  date  suggest  that  using  the 
modifiable  coupling  matrix  rather  than  the  fixed  constraint  may  accelerate 
the  maximization  of  the  PP  Index  H.  At  present  we  use  both  forms  in  ex¬ 
periments,  although  results  reported  here  were  obtained  with  the  modifiable 
constraint  matrix. 

2.2  Results  for  a  Remote  Sensing  Application 

To  investigate  the  potential  usefulness  of  PP  techniques  for  the  extraction  of 
textural  features  in  future  Multi- Angle  Imaging  Spectro-Radiometer  (MISR)  [6] 
data,  we  have  analyzed  17  high  resolution  images  from  the  Airborne  Visi¬ 
ble/Infrared  Imaging  Spectrometer  (AVIRIS)  [13]  operated  by  the  Jet  Propul¬ 
sion  Laboratory.  We  use  the  four  (out  of  224)  AVIRIS  channels  centered  at 
the  MISR  wavelengths  of  443,  555,  670,  and  865  nm  [6].  An  example  of 

'^Ep{)  refers  to  the  expected  value  across  projections  in  the  network. 
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Figure  2:  Two-dimensional  histogram  of  data  projected  onto  a  pair  of  projections  in  our  new  PP  network 
after  training;  each  axis  is  the  degree  of  overlap  with  one  of  the  selected  projections.  Inputs  were  gray-level 
difference  vector  (GLDV)  histograms  [5]  [15]  derived  from  AVIRIS  band  5;  input  windows  were  12x12  pixels. 


multi-modal  structure  found  in  the  AVIRIS  data  by  our  new  PP  algorithm 
is  shown  in  Figure  2.  A  typical  end  result  for  one  of  the  novel  AVIRIS  test 
images  using  related  PP  techniques  [1]  [2]  from  our  earlier  research  along 
with  standard  texture  features  [14]  is  illustrated  in  Figure  3.  As  Figure  3 
demonstrates,  the  majority  of  incorrect  identifications  are  at  the  edges  of  the 
clouds  where  the  features  are  a  mixture  of  cloudy  and  cloud-free  imagery.  In 
Figure  4,  results  obtained  with  our  new  PP  algorithm  combined  with  cross¬ 
entropy  based  backward  propagation  (BPCE)  [12]  as  a  back-end  classifier  are 
compared  against  these  other  approaches.  For  comparable  architecture  sizes, 
the  new  PP  algorithm  with  BPCE  appears  to  be  better  than  the  other  PP 
algorithms  individually  combined  with  BPCE.  Also,  the  new  PP  algorithm 
with  BPCE  is  statistically  closest  to  the  “all”  cases  that  achieved  the  best 
performance  by  combining  features  derived  by  all  of  the  other  PP  algorithms 
along  with  standard  statistical  features  [14].  Mean  performance  on  the  novel 
held-out  image  for  the  best  “all”  case  is  (93.5  ±  6.8)  %  cloud  pixels  detected 
with  false  alarm  rate  of  (10.6  ±  10.0)  %.  The  size  of  the  errors  bars  in  Figure  4 
is  somewhat  large  due  to  several  factors:  the  limited  number  of  trials  (eight 
-  ten  per  paradigm),  the  fact  that  there  is  a  high  degree  of  variability  within 
even  this  limited  database,  and  the  complexity  of  the  search  space,  that  is 
full  of  local  minima.  Also,  this  database  is  significantly  smaller  than  is  nec¬ 
essary  to  completely  characterize  class  variability  and  adequately  represent 
the  cloud  detection  problem  domain  in  the  training  set.  A  larger  statistical 
sample  of  experiments  and  further  tuning  of  the  individual  PP  algorithms  is 
needed  to  obtain  a  better  estimate  of  the  relative  merits  of  the  approaches, 
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Figure  3  •  Cloud  detection  in 


novel  AVIRIS  test  scene.  (Upper  left)  Near  infra-red  band  (865nm)  for 


an  AVIRIS  scene  in  the  test  set;  band  is  one  of  4  chosen  to  correspond  spectrally  to  future  MISR  bands; 
colormap  is  black  (low)  to  white  (high).  (Upper  right)  Human  interpretation  of  cloud  pixel  location,  white 
=  cloud,  black—  no  cloud.  (Lower  left)  Cloud  detection  result  from  an  ensemble  network  using  extracted 
Wavelet  Projection  Pursuit  (WPP)  features  [2],  BCM-PP  [4]  [11]  [1]  features  from  GLDV  histograms,  BCM- 
PP  features  from  simply  normalized  pixel  intensities  [1],  and  standard  standard  statistical  moments  [14]  from 
GLDV;  features  in  the  ensemble  model  were  extracted  from  all  four  spectral  channels;  white  =  cloud,  black  = 
no  cloud.  (Lower  right)  difference  mask:  black  =;  no  error,  gray  =  falae-alarm,  white  =  false  negative;  most 
errors  are  on  cloud/no-cloud  boundaries.  Cloud  detection  rate  for  this  novel  test  image  was  94.0  %  with  false 
alarm  rate  4.8  %  . 


but  these  first  results  are  encouraging  and  suggest  that  features  from  the  new 
PP  algorithm  might  be  more  robust  than  those  found  by  the  other  PP  algo¬ 
rithms  and  should  be  added  to  the  “all”  case  to  look  for  further  improvement 
in  future  experiments. 
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Abstract 

Feature  extraction  is  an  important  preliminary  step  to  classification  of 
complex  signals.  By  reducing  a  high-dimensional  signal  to  a  lower¬ 
dimensional  feature  set  which  preserves  the  relevant  structure  of  the  signal, 
classification  performance  is  enhanced.  A  classification  system  was  developed 
to  classify  sonar  signals  as  to  whether  the  object  detected  is  minelike  or  non¬ 
minelike.  Results  are  presented  comparing  classification  performance  when 
various  feature  extraction  methods  are  implemented. 


INTRODUCTION 

Our  objective  is  to  extract  pertinent  features  from  acoustic  signals  in  order  to 
develop  an  automated  classifier  for  sonar  signals.  The  specific  application  in  this 
case  is  classification  of  underwater  mines,  but  this  technique  could  be  readily 
applied  to  other  types  of  sonar  signals.  We  have  used  sonar  images  of  the  sea 
bottom  which  contain  sonar  returns  generated  mines  of  various  types.  The 
purpose  of  the  classifier  is  to  confirm  the  presence  of  a  mine,  and  then  to 
ultimately  determine  what  type  of  mine  is  present.  The  emphasis  in  here  is  not  the 
classifier  itself,  but  the  process  of  feature  extraction  which  produces  a  projection 
of  the  signal  in  a  lower  dimensional  feature  space,  thus  improving  classifier 
performance. 

The  results  that  will  be  presented  in  this  paper  show  successful  implementation 
of  feature  extraction  algorithms  in  a  system  for  sonar  signal  classification.  Results 
are  described  for  a  two-class  (mine  and  rocks)  and  four-class  (three  mine  types 
and  rocks),  >vith  a  comparison  of  results  using  se\^eral  methods  of  preprocessing. 


U.S.  Government  work  not  protected 
by  U.S.  copyright 


FEATURE  EXTRACTION 


In  many  of  the  classification  systems  currently  used,  the  process  of  feature 
extraction  is  relatively  ignored  in  comparison  to  the  method  of  classification  itself 
This  is  due  to  the  fact  that  in  many  classification  techniques,  nemal  networks  in 
particular,  the  process  of  feature  extraction  is  inherently  embedded  in  the 
classification  technique  rather  that  being  apparent  as  a  separate  process.  If  a 
multi-layer  neural  network  is  used  to  classify  unprocessed  data,  the  input  layer, 
which  learns  from  examples,  will  essentially  become  a  feature  extractor.  Also,  the 
increased  availability  of  computational  power  enables  complex  data  to  be 
classified  with  less  preprocessing. 

However,  in  problems  such  as  machine  vision,  speech  identification  or  the 
problem  posed  here  of  detecting  objects  in  sonar  returns,  the  input  dimensionality 
of  the  problem  becomes  an  impediment  to  classification.  Even  neural  networks 
are  limited  by  the  problem  of  parameter  estimation-  as  the  number  of  parameters 
increases,  the  size  of  the  data  required  to  train  the  network  increases  faster.  For  a 
complex  problem,  obtaining  the  necessary  data  may  be  expensive  or  even 
impossible. 

Much  has  been  written  about  this  dilemma,  known  commonly  as  the  curse  of 
dimensionality^.  Because  of  the  inherent  sparsity  of  high-dimensional  spaces,  an 
inordinate  amount  of  training  data  is  required  to  obtain  reasonably  low  variance 
estimators.  The  objective  of  feature  extraction  is  to  reduce  the  dimensionality  of 
the  data  before  performing  classification.  This  is  based  on  the  assumption  that  the 
important  structure  in  the  data  actually  lies  in  a  much  lower  dimensional  space.  If 
the  transformation  from  the  high  dimensional  space  to  the  low  dimensional  space 
is  accomplished  without  losing  much  relevant  information,  classification 
performance  will  be  enhanced. 

Thus  the  optimal  method  of  classification  of  high  dimensional  data  is  a  two 
step  process:  feature  extraction  which  projects  the  original  data  to  a  lower 
dimensional  feature  space;  and  then  performing  classification  in  the  low 
dimensional  space.  An  important  consideration  is  that  the  features  obtained  must 
be  independent  of  class  membership,  because  at  this  point  in  the  process  the  class 
to  which  the  data  belongs  is  not  yet  known. 


DESIGN  OF  THE  FEATURE  EXTRACTION/CLASSIFICATION  SYSTEM 

An  overall  scheme  for  preprocessing  a  given  acoustic  signal  (each  sonar  return 
consisting  of  either  a  minelike  object  or  a  non-minelike  object  such  as  a  rock), 
reducing  it  to  a  set  of  features,  and  using  the  feature  set  as  an  input  to  an 
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automated  classifier  was  developed  an  implemented.  Various  types  of  learning 
algorithms  and  signal  processing  techniques  were  used  to  perform  the 
preprocessing.  These  methods  include  Bienenstock,  Cooper  and  Munro  (BCM) 
learning  theory,  principal  components  analysis,  and  several  types  of  transforms 
(Karhunen-Loeve,  FFT,  and  wavelets).  Results  for  some  of  these  techniques  are 
described  here;  other  results  will  be  available  shortly  and  will  be  included  in  the 
final  paper. 


THE  BCM  ALGORITHM 

Modification  of  synaptic  effectiveness  between  cortical  neurons  is  widely 
believed  to  be  the  physiological  basis  of  learning  and  memory,  and  there  is 
evidence  that  similar  synaptic  plasticity  in  many  areas  of  mammalian  cortex.  In 
1982.  Bienenstock,  Cooper  and  Munro[2]  proposed  a  concrete  synaptic 
modification  hypothesis  in  which  two  regions  of  modification  (Hebbian  and  anti- 
Hebbian)  were  stabilized  by  the  addition  of  a  sliding  modification  threshold.  The 
BCM  theory  was  originally  created  to  explain  the  development  of  orientation 
selectivity  and  binocular  response  in  various  visual  environments  in  kitten  striate 
cortex,  one  of  the  most  thoroughly  studied  areas  in  neuroscience. 

In  the  BCM  model,  a  change  in  the  weight  of  a  synapse  is  equal  to  the  product 
of  the  presynaptic  activity  and  a  certain  function  (^)  of  the  postsynaptic  activity. 
The  qualitative  consequences  of  the  theory  follow  from  a  few  properties  of  the  (jf 
function: 

-when  the  postsynaptic  activity  is  above  the  modification  threshold  (©;„),  (p  is 
positive  (i.e.  an  active  synapse  will  be  potentiated); 

-when  postsynaptic  activity  is  below  (p  is  negative  (i.e.  an  active  synapse 
will  be  depressed); 

-there  is  no  activity  {(p  =  0)  when  postsynaptic  activity  is  equal  either  to  the 
spontaneous  firing  rate  or  to 

-the  value  of  ©«  is  not  fixed,  but  rather  "slides”  as  a  function  of  the 
postsynaptic  activity. 

It  is  the  selectivity  of  BCM  neurons  that  makes  this  approach  so  suitable  to  the 
problem  of  feature  extraction.  In  a  BCM  network,  a  small  number  (say,  six)  of 
BCM  neurons  are  connected,  with  lateral  inhibition  introduced  into  the 
connections.  In  the  architecture  used  here,  the  Hh  BCM  neuron  is  inhibited  by  the 
1st  through  (A:-l)th  neurons.  Thus,  a  small  network  of  BCM  neurons  is  capable  of 
providing  a  set  of  distinct  features  from  an  input  signal.  The  usual  approach  is  to 
train  the  neurons  on  a  combined  set  of  samples  of  each  ripe  of  data  (i.e..  in  an  «- 
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class  classification  problem,  train  on  all  n  classes  together).  The  class  separability 
characteristic  inherent  in  BCM  neurons  yields  a  set  of  features  which  tends  to  be 
very  effective  in  identification  of  the  distinct  classes. 

ih-evious  research  presented  by  the  author  [3]  has  demonstrated  the  capability 
of  the  BCM  learning  rule  to  perform  feature  extraction  from  acoustic  signals.  In 
this  study,  both  single  BCM  neurons  and  networks  of  laterally-inhibited  BCM 
neurons  were  trained  on  samples  of  marine  mammal  sounds.  The  result  of  this 
process  is  a  weight  vector  (or  set  of  vectors  in  the  multiple  neuron  case)  which, 
when  convolved  with  a  signal  of  interest,  constitutes  “testing”.  The  output  of  the 
convolution  process  can  then  be  input  into  an  automated  classifier  (feed-forward 
neural  network,  nearest  neighbor,  etc.).  The  results  obtained  ty  testing  on  two 
types  of  marine  mammals  (porpoise  and  dolphin)  demonstrated  the  capability  of 
BCM  neurons  to  become  distinctly  selective  to  one  type  of  signal 


MODEL  DESCRIPTION 

The  design  of  the  BCM-based  classifier  is  depicted  in  Figure  1  (the  overall 
schematic  for  classifier  systems  employing  other  preprocessing  techniques  is 
essentially  the  same).  A  network  of  laterally-inhibited  BCM  neurons  is 
constructed:  the  number  varies  depending  on  the  problem,  but  in  the  work 
described  here  6  neurons  were  used.  A  parameter  that  adjusts  the  degree  of 
inhibition  between  neurons  is  set;  the  optimal  degree  of  inhibition  is  generally 
determined  by  performing  several  training  runs  and  observing  how  well  the  value 
of  ©m  converges  as  this  parameter  is  adjusted. 

Examples  of  data  from  each  class  are  used  to  train  the  BCM  network.  As 
discussed  previously,  this  is  an  unsupervised  procedure-  the  classes  of  the  data 
must  be  assumed  unknown  at  this  point.  Training  the  neurons  on  only  one  type  of 
data  will  result  in  features  that  are  dependent  on  the  class  of  data  used  for 
training-  thus  they  will  not  be  effective  to  achieve  separability  between  classes. 
Thus,  if  we  are  attempting  to  ascertain  if  a  sonar  return  is  of  a  mine  or  a  rock,  we 
must  train  the  BCM  network  on  a  combined  set  of  mine  and  rock  data. 

The  result  of  this  process  is  that  a  set  of  BCM  weight  vectors  is  produced.  The 
number  of  vectors  will  be  the  same  as  the  number  of  BCM  neurons  in  the 
network;  the  length  of  the  vector  is  of  a  predetermined  length,  as  appropriate  for 
the  type  of  signal.  When  using  acoustic  data,  the  procedure  is  to  sample  each  set 
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Figure  1.  Schematic  of  BCM-based  classifier  system 


of  data  randomly  running  a  window  over  it  (typically  of  length  512  or  1024) 
and  obtaining  a  vector  of  that  length. 

Next,  the  dot  product  of  the  signal  to  be  tested  and  the  set  of  BCM  weight 
vectors  is  computed.  The  resultant  vectors  comprise  the  feature  set  to  be  used  for 
classification.  In  all  of  the  experiments  described  in  this  paper,  a  feed-forward 
neural  network,  using  a  standard  backpropagation  algorithm,  was  used.  The 
intention  here  is  not  to  focus  on  the  classifier  itself,  but  the  process  of  constructing 
the  inputs  to  the  classifier.  In  order  to  train  the  classifier,  the  testing  of  the  BCM 
network  is  first  done  with  data  of  known  classification:  then  the  procedure  is  done 
with  the  trained  classifier  to  determine  the  classification  of  an  unknown  sample. 


EXPERIMENTAL  RESULTS 

The  above  described  procedure  was  performed  with  various  sets  of  real  data 
(obtained  from  surface  ship  sonars)  and  sets  of  synthetic  data.  In  order  to  provide 
a  benchmark  for  classifier  performance,  many  of  the  tests  were  repeated  using 
wavelet  or  FFT  preprocessors  as  well  as  BCM  feature  extraction:  all  with 
classifiers  of  compar^le  architecture.  The  results  in  the  comparison  studies  using 
these  three  techniques  were  that  BCM  consistently  performed  as  good  as  or  better 
than  wavelet  preprocessing:  both  methods  were  consistently  better  than  FFT 
preprocessing. 

Figure  2  shows  the  results  of  testing  on  a  set  of  data  obtained  by  using  sets  of 
sonar  returns  from  minelike  and  non-minelike  objects  that  were  denoised  as  much 
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as  possible;  then  noise  was  incrementally  introduced  to  the  signals  to  produce  a 
set  of  sonar  returns  of  varying  SNR.  The  classification  results  using  BCM, 
wavelet,  and  FFT  preprocessing  were  plotted  versus  the  SNR  for  the  various 
signals.  In  this  example,  BCM  produced  the  best  results  in  the  mid-range  of  the 
curve.  While  the  actual  SNR  values  here  might  not  be  the  same  as  a  “real”  mine 
classification  problem,  it  is  the  middle  of  the  curve  that  is  most  significant.  When 
the  SNR  is  close  to  zero,  no  classifier  is  going  to  do  much  better  that  a  random 
guess.  When  the  SNR  is  high,  any  classifier  will  have  a  good  chance  of 
performing  well,  but  real  sensors  are  generally  not  able  to  achieve  this  level  of 
SNR,  particularly  in  the  shallow  water  environments  of  interest  to  the  Navy. 

The  technique  can  also  be  applied  to  problems  with  more  that  two  classes.  In 
one  experiment,  samples  of  returns  were  used  containing  three  types  of  mines 
(two  bottom  mines  and  one  close-moored  mine)  as  well  as 


Figure  2.  Comparison  of  Percent  Correct  Classification  of  Mines  vs.  SNR  for  BCM,  wavelet 

and  FFT  preprocessinff 
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rocks.  The  SNR  of  the  signals  is  not  considered  in  this  problem.  The  confosion 
matrix  of  the  classification  results,  comparing  BCM  with  FFT  preprocessing,  is 
shown  in  Table  1.  The  percent  of  correct  classification  when  BCM  was  employed 
ranged  from  87%  to  94%  among  the  classes;  when  FFTs  were  used  the  results 
ranged  from  59%  to  69%  correct. 


CONCLUSIONS  AND  ONGOING  RESEARCH 

We  have  developed  and  demonstrated  an  automated  classifier  for  acoustic 
signals  using  the  BCM  learning  procedure  to  perform  feature  extraction.  The 
results  obtained  in  the  various  classification  problems  support  the  contention  that 
the  BCM  algorithm  is  an  effective  means  of  obtaining  features  for  the  purposes  of 
classification.  We  will  continue  to  run  similar  classification  problems  using  the 
various  feature  extraction  methods  to  determine  the  optimal  method,  which  might 
depend  on  the  type  of  signal  to  be  classified. 

Ongoing  work  is  emphasizing  the  multiple  class  problem,  and  comparing  the 
approach  of  using  a  one  or  two  levels  of  classification.  The  multiple  mine  type 
classifier  may  be  posed  as  a  two  level  classification  problem,  as  a  sonar  return 
from  a  submerged  object  is  first  classified  as  minelike  or  non-minelike,  and  then 
minelike  objects  are  identified  as  a  particular  ripe  of  mine. 


70 


A  particular  area  of  interest  at  present  is  determining  the  characteristics  of  the 
features  that  the  BCM  neurons  identify  in  minelike  objects.  For  example,  a  trained 
sonar  operator  determines  that  an  object  is  minelike  by  observing  several 
characteristics  (size,  shape,  reverberation,  shadow,  aspect)  in  the  sonar  image.  We 
are  presently  attempting  to  ascertain  if  the  features  derived  by  BCM  are 
analogous. 

Another  method  that  promises  to  achieve  good  classification  performance  is 
using  genetic  algorithms  to  obtain  the  best  feature  set  for  the  classifier.  Our  work 
with  this  approach  is  too  preliminary  to  present  results  at  the  moment,  and  will 
be  presented  in  a  follow-on  paper. 
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Abstract.  An  algorithm  how  to  define  and  initialize  nonlinear  re¬ 
current  models  using  linear  models  is  described.  From  a  modeling 
point  of  view  it  is  natural  to  try  linear  models  first  and  then  con¬ 
tinue  with  nonlinear  models.  The  suggested  method  gives  such  an 
algorithm  and  the  nonlinear  recurrent  model  is  defined  as  an  ex¬ 
tension  of  the  linear  model.  This  gives  less  problems  with  local 
minima  compared  to  a  random  initialization.  Also,  the  stability  of 
the  model  and  its  derivative  with  respect  to  the  parameters  can  be 
guaranteed  which  is  a  requirement  for  the  prediction- error  estima¬ 
tion  method  (sometimes  called  back-propagation  through  time)  to 
be  applicable. 

1.  INTRODUCTION 

Parameter  estimation  of  recurrent  neural  networks  is  often  described  as  a 
tricky  subject.  See,  e.g.^  [3].  The  problem  can  be  described  as  an  itera¬ 
tive  criterion  minimization.  Depending  on  the  initial  parameter  guess  the 
solution  converges  to  different  local  minima.  With  a  good  initial  parameter 
guess  the  chances  increase  to  converge  to  the  global  minimum,  or  at  least  to 
a  favorable  local  minimum.  Recurrent  models  are  especially  sensitive  to  the 
initial  parameter  values  and  convergence  to  a  good  minimum  cannot  be  ex¬ 
pected  unless  two  filters  are  stable.  The  first  filter  is  the  predictor  model  and 
the  second  one  is  the  derivative  of  the  prediction  with  respect  to  the  model 
parameters. 

In  this  contribution  we  have  a  slightly  different  approach  than  in  many 
other  works  on  recurrent  networks.  We  do  not  have  a  specific  recurrent  net¬ 
work  in  focus  which  is  investigated  and  applied  to  different  problems.  Instead, 
the  approach  is  more  problem  oriented  in  an  engineering  point  of  view.  Given 
input-output  data  from  a  dynamic  system  (or  just  output  data  in  case  of  a 
time  series)  the  goal  is  to  obtain  a  model  which  describes  the  system  as  good 
as  possible  and  in  this  search  a\l  kind  of  different  models  can  be  considered. 


0-7803-4256-9/97/$  10.00  ©1997  IEEE 


72 


In  the  search  for  a  good  model  structure  (i.e.  network)  one  might  want  to 
try  some  recurrent  models.  Then  the  topic  of  this  work  becomes  relevant: 
initialization  of  nonlinear  recurrent  models  (t.e.  recurrent  networks). 

Since  the  parameter  estimate  has  to  be  computed  by  an  iterative  search 
for  the  minimum  of  a  criterion  of  fit  there  will  usually  be  problems  with 
local  minima.  Already  with  linear  recurrent  models  one  has  such  problems 
and  a  good  initial  parameter  guess  is  important  to  assure  convergence  of  the 
iterative  search  to  the  global  minimum.  For  nonlinear  models  the  problems 
with  local  minima  will  be  even  more  serious. 

Example  1  In  Figure  1  a)  the  true  step  response  of  a  linear  second  order 
system  is  shown  together  with  the  response  of  a  model  corresponding  to  a  local 
minimum.  The  true  system  is  oscillating  and  its  complex  poles  are  close  to 
the  unit  circle,  Figure  1  h).  The  model  has  a  real  pole  in  between  the  true 
poles  and  its  second  pole  is  just  outside  the  unit  circle.  This  local  minimum 
corresponds  to  a  model  which  has  only  modelled  the  mean  value  of  the  output 
signal. 


OOTPUT#!  INPUT#  1 


Figure  1:  A)  Step  response  of  a  second  order  linear  system  and  a  model  correspond¬ 
ing  to  a  local  minimum,  b)  The  two  complex  poles  (x)  of  the  linear  system  at  the 
unit  circle  and  those  of  the  model  on  the  real  axis. 

By  starting  with  a  linear  model  and  adding  a  nonlinearity  to  it,  the  linear 
model  gives  a  clue  how  to  choose  the  parameters  in  the  nonlinear  part.  It 
has  to  be  decided  upon  the  position  of  the  basis  functions  and  they  should 
be  placed  so  that  they  are  activated  by  the  estimation  data.  Otherwise  the 
nonlinearity  will  not  have  any  influence  on  the  model.  To  make  this  decision 
one  needs  a  “preliminary”  model.  Since  there  exist  well  developed  algorithms 
and  tools  for  linear  models  and  since  linear  models  perform  well,  or  fairly  well, 
on  many  problems  it  is  natural  to  choose  the  preliminary  model  to  be  linear. 
There  is,  however,  no  problem  to  apply  the  algorithm  to  an  existing  nonlinear 
model  and  then  obtain  a  more  advanced  nonlinear  model.  In  addition  to  the 
position  of  the  basis  function  one  also  has  to  decide  upon  the  amplitude. 
By  choosing  it  to  zero  the  initial  nonlinear  model  becomes  equal  to  the  linear 
model.  The  advantage  of  this  is  that  stability  of  the  two  Alters  follows  from  the 
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stabUity  of  the  linear  model.  Also,  the  parameters  are  calculated  by  a  gradient 
based  minimization  starting  at  a  point  where  the  performance  equals  that  of 
the  linear  model.  This  means  that  the  performance  of  the  nonlinear  model 
cannot  become  worse  than  that  of  the  linear  model.  Hence,  the  algorithm 
guarantees  a  nonlinear  model  which  performs  better  (on  estimation  data) 
than  the  linear  model. 

The  algorithm  is  quite  general  and  can  be  used  to  obtadn  nonlinear  exten¬ 
sions  to  any  linear  model  class.  Recurrent  models  are  equivalent  to  state-space 
models,  see,  e.^.,  [7],  so  the  algorithm  presented  here  can  be  seen  as  a  proce¬ 
dure  how  to  obtain  nonlinear  black-box  state-space  models.  A  subset  of  the 
recurrent  models  can  also  be  described  sis  Output  Error  (OE)  and  ARMAX 
models.  Hence,  the  algorithm  can  also  be  used  to  construct  and  estimate 
nonlinear  OE  and  ARMAX  models. 

To  the  authors  best  knowledge  there  is  very  little  work  done  addressing 
the  initialization  of  recurrent  models.  Typically,  the  parameters  are  initialized 
with  small  random  values  and  the  suggested  method  will  be  compared  to  the 
random  choice  on  a  small  example  in  Section4. 

The  paper  is  organized  as  follows.  A  short  background  and  a  problem 
definition  is  given  in  Section  2.  Then  the  initialization  algorithm  and  the 
stability  of  the  two  nonlinear  filters  are  given  in  Section  3.  The  suggested 
method  is  applied  to  a  small  example  in  Section  4  and  the  paper  is  concluded 
in  Section  5. 

2.  PROBLEM  DESCRIPTION 

Let  yit)  be  the  output  of  the  process  to  be  modeled  and  u{t)  the  input  signal 
whi^  influences  the  process.  For  simplicity  it  will  be  assumed  that  both 
y{t)  and  u{t)  are  scalars.  It  is  strmght  forward  to  extend  the  results  to  the 
multi-input  and  multi-output  case.  For  time  series  where  no  input  is  used  the 
results  follows  right  away  by  just  canceling  u(t)  in  all  equations. 

The  goal  is  to  find  a  model  which  uses  past  measurements  to  predict  future 
outputs  y{t).  A  common  black-box  approach  is  to  consider  a  parameterized 
candidate  model 

y{0,t)  -  (1) 

where  is  the  prediction  of  y{t),  B  is  the  parameter  vector  to  be  tuned, 
and  ip{t,  B)  is  the  regressor  which  contains  information  available  at  time  t.  In 
this  way  the  modeling  has  been  divided  into  two  parts,  a  choice  of  regressor  (p 
and  a  choice  of  mapping  g.  Depending  on  the  choice  of  B)  and  g  different 
models,  or  neural  nets,  are  obtained.  A  short  background  on  this  is  given 
here,  see  [9]  for  a  deeper  discussion. 

2.1.  The  regressor 

The  regressor  B)  is  formed  by  a  mapping  from  data 
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where  =  [y(l),  . . . ,  y{t—\)]  and  is  defined  analogous.  If  (p  depends 
on  B  the  model  is  recurrent,  i.e.,  the  regressor  contains  information  given  by 
the  model  at  earlier  time  instants.  This  can  be  described  in  several  different 
ways,  e,g.,  by  using  states 

(2) 

where 

= 

and  a:(t)  are  the  states  of  the  model,  i.c.,  all  information  which  has  to  be 
stored  at  each  time  step  and  then  fed  into  the  model  at  the  next  time  instant. 
The  function  g  is  now  a  function  H**"*"^  where  d  is  the  number  of 

states.  The  description  (2)  includes  all  recurrent  models  like,  e.g.,  Elman  and 
Jordan  networks,  see  [7]  for  further  explanation. 

Also  input-output  models  can  be  described  like  (1)  with  components  of 
the  regressor  (p  chosen  in  analogy  with  their  linear  counterparts.  See  [9]. 

•  FIR  (p{t)  consisting  of  u(t  -  A;,*), 

•  ARX  <p{t)  consisting  of  y{t  —  ky),u{t  —  ku)^ 

•  OE  (p{t,  0)  consisting  of  y{t  —  ky),u{t  -  k^) 

•  ARMAX  (p(t,  0)  consisting  of  y{t  -  ky),u{t  —  ku),£{B,t  —  ky) 

where  £{B,  t)  ~  y{t)  -  y{t,  0),  ky  =  1,2..,  and  =  0, 1 . . .. 

The  models  where  <p{t,B)  depends  on  x{t)  or  y(t,9),  i.e.,  the  state-space 
models,  OE,  and  ARMAX  models,  are  called  recurrent. 

Linear  models  are  obtained  by  choosing  p  to  be  a  linear  mapping.  Note 
that  if  the  model  is  recurrent  or  not  depends  only  on  the  choice  of  regressor, 
(p,  and  not  on  the  mapping  g.  Hence,  the  recurrent  regressors  mentioned 
above  also  give  recurrent  models  when  g  is  chosen  to  be  linear. 

In  the  following  the  input-output  form  (1)  of  the  model  will  be  used  in 
the  discussions.  It  is  straight-forw2ird  to  modify  the  method  to  state-space 
models. 

2.2.  The  Mapping 

The  mapping  g{',  •)  in  (1)  can  be  any  parameterized  function.  Most  black-box 
models  can  be  described  as  a  basis  function  expansion 

n 

y(t)  =  9(6,  ‘Pit,  6))  =  Ci9ii<pit,  6),ai)  (3) 

i=l 

e  =  [ci  Oi  C2O2  ...Cnan]. 
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Depending  on  the  specific  choice  of  the  functions  gi{(p,ai)  we  obtsdn  different 
model  structures  like  neural  nets,  radial  basis  functions  etc.  The  basis  func¬ 
tion  expansion  can  also  be  mixed  with  different  types  of  functions 
and  here  we  will  consider  the  case  when  the  first  function  in  the  expansion  is 
linear.  Then(3)  can  be  expressed  as 

n 

y{t)  =  g(e,  ip(t,  6))  =  ej  <pi  (t,  e)  +  ^  Cigi{<pr<l  it,  6),  Oi)  (4) 

t=l 

where  Of  (pi  (t,  0)  expresses  the  linear  model  which  will  be  is  used  to  initialize 
the  parameters  in  the  second,  nonlinear  term.  Subscript  /  denotes  linear-  and 
nl  nonlinear  part.  This  gives  us  the  freedom  to  use  different  regressors  in  the 
different  parts. 


2.3.  Calculate  the  Parameter  Estimate 

Given  a  model  structure  g  and  an  estimation  data  set  {2/(t),u(t)}£i  the 
parameter  estimate  Ojv  is  defined  as  the  minimum  of  a  criterion  of  fit,  e.g., 
sum  of  squared  errors 


where 


Ojv  =  arg  min  (0) 
0 


eiS,t)  =  yit)-y{e,t).  (6) 

The  criterion  (5)  is  typically  minimized  by  an  iterative  search  based  on  the 
gradient  of  the  criterion.  See,  e.g.,  [1,  10].  From  an  initial  parameter  value 
^0  the  criterion  (5)  is  stepwise  decreased  by  changing  the  parameter  vector 
in  a  descent  direction  of  the  criterion  until  a  minimum  is  reached. 

The  derivative  of  the  criterion  (5)  becomes 


dVN{0) 

dO 


2 


where  the  derivative  of  the  model  output  (1)  is 


dg{0,tp{t,e))  _dgid,<p{t,e))  ''^^dg{0Mt.0))d<Piit,e) 
d$  dO  ^  dtpiit.O)  dO  ' 


The  second  term  is  non-zero  only  in  the  case  the  model  is  recurrent.  Then 
contains  components  of  g(0y(p(t)),  i  <  t.  This  makes  (8)  a  nonlinear 
filtering  with  input  signal 

dgiO,(p{t,$)) 

dO 
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Before  the  filtering  (8)  can  be  performed  must  be  obtained.  Hence 

also  the  filter 

y(.t,e)  =  g(e,(p(t,e))  (9) 

with  input  u(t)  must  be  performed. 

The  success  of  the  iterative  minimization  of  Vlv(^)  (5)  (or  learning)  de¬ 
pends  on  the  initial  parameter  guess  ^o-  A  better  means  that  the  param¬ 
eter  estimate  converges  to  a  better  local  minimum  of  Vn(0),  Moreover,  as 
described  in  [4]  the  nonlinear  filters  (8)  and  (9)  must  be  stable  for  a  succesful 
minimization. 


3.  INITIALIZING  THE  NONLINEAR  MODEL 

Here  follows  a  presentation  of  a  “natural”  first  step  from  a  linear  model  to 
a  nonlinear  model.  Consider  a  basis  function  expansion  like  (4)  and  let  the 
linear  model  be  the  first  basis  function.  If  the  model  is  non-recurrent  the 
parameters  c,-  can  be  fitted  by  least  squares  and  the  nonlinear  model  will 
be  better  than  the  linear  model  already  at  the  initialization.  For  recurrent 
models  the  parameters  Ci  are  initially  set  to  zero.  This  gives  an  initialization 
of  the  nonlinear  model  which  is  identical  to  the  linear  model.  It  'will  be  shown 
that  the  stability  of  the  filters  (8)  and  (9)  then  follows  from  the  stability  of 
the  linear  model.  Of  special  importance  is  the  fact  that  the  linear  model  can 
be  used  to  obtain  a  0)  which  is  necesseiry  to  obtain  a  good  initialization 
of  the  parameters  Oi  in  (4). 

Proceed  the  modelling  as  follows: 

•  Start  with  a  linear  model  as  depicted  in  Figure  2  a)  (but  with,  possibly, 
different  regressor)  and  estimate  its  parameters. 

•  Complement  the  linear  model  with  nonlinear  elements,  for  example  as 
shown  in  Figure  2  b).  Initialize  the  nonlinear  part  and  calculate  9n  by 
minimizing  Vn{B). 


y{M) 

y(t-nl) 

u(M) 

uft-rg 


Figure  2:  a)  Linear  model,  b)  Linear  model  complemented  with  a  nonlinearity. 

This  approach  gives  us  a  model  structure  like  the  one  in  (4).  In  case  a 
state-space  model  (2)  is  used  the  added  nonlinearity  can  be  chosen  to  influence 
the  model  output  y{t)  or  one  of  the  states  x(t). 

The  nonlinear  element  can  have  a  regressor  which  differ  from  the  linear 
regressors  (pi.  In  this  way  the  nonlinear  model  can  be  chosen  to  be  linear  with 
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respect  to  some  regressors.  Such  models  will  be  less  nonlinear  but  have  other 
nice  features.  See  [6,  9]. 

When  the  structure  is  decided  upon  it  is  time  to  initialize  the  nonlinear 
model.  There  are  three  kinds  of  parameters  in  the  model  (4)  which  have  to 
be  initialized: 

6i:  Linear  part  of  the  model. 

ail  Parameters  defining  localization  of  basis  function.  The  model  is  non¬ 
linear  in  these  parameters. 

a:  Amplitude  of  basis  functions.  Model  linear  in  these  parameters. 

The  following  algorithm  specifies  how  the  parameters  are  initialized. 
Algorithm  1  The  parameters  are  initialized  as: 

Oi:  given  by  the  linear  model 

Qi:  chosen  randomly  but  with  a  probability  distribution  so  that  basis  func¬ 
tions  placed  on  the  support  of  regressor 
used  to  construct  l^I- 

Ci:  for  non-recurrent  models  they  are  calculated  by  least  squares  on  the 
residuals  e  =  y-  Bjip  with  gilipni{t,B),ai)  as  regressorsK  If  the  model 
is  recurrent  then  c*  are  set  to  zero. 

The  following  two  theorems  reveal  the  advantage  of  this  initialization. 
Theorem  1  The  nonlinear  non-recurrent  model  (4)  initialized  as  described 
in  Algorithm  1  gives  better  fit  than  the  linear  model 

Proof.  Let  =  y(t)  -  0f(p{t,0)  be  the  residuals  of  the  linear  model. 

Sum  of  squared  errors  of  the  initialized  nonlinear  model  then  becomes 

f2^e{e,t)  -'^Cigi{,<p2(t,9),ai)f 

t=sl  i=l 

since  Cj  are  chosen  to  minimize  this  sum  the  following  inequality  holds 

N  n  N 

with  equality  if  all  Cj  =  0-  .  • 

Theorem  2  The  nonlinear  recurrent  model  (4)  initialized  as  described  in  Al¬ 
gorithm  1  gives  a  fit  equally  good  as  the  linear  model  Moreover,  the  nonlinear 
filters  (8)  and  (9)  are  stable  if  the  linear  model  is  stable. 

^It  is  also  possible  to  fit  $i  together  with  c< 


78 


Proof.  Notice  that  Ci  =  0  =>-  g(0,(p)  =  9j(p  for  all  (p,  i.c.,  the  nonlinear 
model  is  reduced  to  the  linear  model.  From  this  the  stability  of  the  filter  (9) 
follows.  The  stability  of  filter  (8)  depends  on 

dp 

and  using  Cj  =  0  one  obtains  from  (4) 

dg(eM6,t))  dejipi{9,t)  _ 

8^(9, t)  '■ 

Since  this  is  identical  to  the  linear  model  the  stability  follows.  □ 

Hence,  if  the  linear  model  is  stable  then  the  initialized  nonlinear  filters 
also  become  stable.  From  this  initialization  the  criterion  Vn{B)  is  iteratively 
minimized  as  described  in  Section  2.3. 

The  suggested  method  can  be  modified  and  extended  in  a  number  of 
different  ways,  for  example 

•  The  same  procedure  can  be  used  to  add  new  nonlinearities  to  an  existing 
nonlinear  model. 

•  The  idea  can  be  used  to  initialize  Elman  networks,  [2],  and  other  models 
which  are  close  to  linear  if  the  data  are  properly  scaled.  The  nonlinear 
model  can  then  be  initialized  approximately  equal  to  the  linear  model. 

4.  EXAMPLE 

The  suggested  algorithm  will  now  be  applied  to  a  small  problem  to  illustrate 
the  advantages  compared  to  random  initialization  of  the  model  parameters. 

The  data  are  generated  in  the  following  way.  The  input  signal  u{t)  consists 
of  110  samples  with  unit  step  at  sample  10.  Then  the  output  y{t)  is  obtained 
by  filtering  the  input  signal  through  the  filter 

y(t)  =  f{y(t  - 1).  y(*  -  2),  «(*  - 1))  (lo) 

where  /(•)  is  one-hidden-layer  sigmoidal  neural  net  with  two  sigmoids.  The 
parameters  are  chosen  so  that  y(t)  becomes  oscillating  but  cannot  be  models 
satisfactory  by  a  linear  model.  The  output  signal  is  shown  in  Figure  3  b). 

Two  different  approaches  to  model  the  data  are  tested.  First  a  neural  net 
model  of  the  same  type  as  that  one  which  generated  the  data  is  tried 

y(t, 9)  =  g(0,  y(t  -  1),  v(t  -  2), «(t  -  1))  (11) 

where  g  is  the  network  and  0  its  parameters.  Several  different  initial  param¬ 
eters  chosen  randomly  from  Gaussian  distributions  with  different  variances 
are  tried.  From  the  different  initial  values  the  criterion  of  fit  (5)  is  minimized 
with  the  Levenberg-Marquardt  algorithm.  The  equation  (8)  is  used  to  ob¬ 
tain  the  derivative  of  the  model.  For  on  one  of  the  initial  parameters  values 
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the  criterion  decrease  as  a  function  of  the  number  of  iterations  is  shown  in 
Figure  3  a).  The  simulation  of  the  obtained  model  is  depicted  in  Figure  3 
b)  together  with  the  true  output.  Obviously  the  network  has  not  been  able 
to  model  anything  else  than  just  the  mean  level  of  the  signal.  Although  a 
larger  number  of  different  initial  parameter  values  were  tried  none  of  them 
performed  any  better  than  the  depicted  result. 

The  second  approeich  starts  with  a  linear  model 

=  /i?(*  “  1)  +  -  2)  +  - 1)* 

To  estimate  the  parameters  /i,  /2,  &i  of  the  linear  model  one  can  use  standard 
methods  which  are  good  to  avoid  local  minima.  See,  e.g.^  [4,  5].  The  simu¬ 
lation  with  this  linear  model  is  also  depicted  in  Figure  3  b).  Although  it  is 
not  able  to  described  the  output  very  well,  it  describes  the  oscillations  fairly 
well.  The  estimated  linear  model  is  then  complemented  with  a  nonlinear  part 
which  equals  the  model  (11)  and  the  parameters  are  estimated  according  to 
Algorithm  1.  Since  there  is  some  randomness  in  the  position  of  the  basis 
functions  ai  the  procedure  is  repeated  several  times  giving  a  set  of  different 
initial  parameter  values.  The  same  numericcd  minimization  algorithm  as  for 
the  randomly  initialized  network  is  then  used  to  calculate  the  minimum  of  the 
criterion.  A  typical  result  is  shown  in  Figure  3.  In  a)  the  criterion  decrease 
is  shown  and  in  b)  the  simulation.  The  performance  is  close  to  perfect  and 
it  would  be  come  even  better  with  more  iterations  of  the  search  algorithm. 
Non  of  the  models  obtained  with  this  initialization  was  stuck  in  any  local 
minimum. 


Figure  3:  A)  Criterion  decrease  as  a  function  of  the  number  of  iterations.  Solid  line: 
random  initialization.  Dashed  line:  The  suggested  method,  b)  Solid  line:  True  step 
response.  Dashed:  rajidom  initialized  model.  Dashed-dotted:  linear  model  used  to 
initialize  the  nonlinear  model.  Dotted:  nonlinear  model  obtained  with  the  suggested 
initialization  (hardly  visible  since  it  is  very  similar  to  the  solid  line.) 


5.  CONCLUSIONS 
It  has  been  shown  that 
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•  There  is  a  “natural”  way  to  define  and  initialize  nonlinear  recurrent 
models  so  that  they  perform  as  good  as  a  linear  model  already  at  the 
initialization.  The  performance  is  then  improved  further  by  a  numerical 
minimization  of  the  criterion  of  fit. 

•  Problems  with  local  minima  are  likely  to  be  smaller  with  the  suggested 
method. 

•  The  suggested  scheme  guarantees  stability  of  the  two  nonlinear  filters 
which  are  necessary  to  minimize  the  criterion  of  fit  by  a  greidient  search. 
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Abstract  -  This  paper  addresses  techniques  for  interpretation  and  char¬ 
acterization  of  trained  recurrent  nets  for  time  series  problems.  In  parti¬ 
cular,  we  focus  on  assessment  of  effective  memory  and  suggest  an  opera¬ 
tional  definition  of  memory.  Further  we  discuss  the  evaluation  of  learning 
curves.  Various  numerical  experiments  on  time  series  prediction  prob¬ 
lems  are  used  to  illustrate  the  potential  of  the  suggested  methods. 

INTRODUCTION 

It  is  widely  recognized  that  recurrent  neural  networks  (RNNs)  are  flexible 
tools  for  time  series  processing,  system  identification  and  control  problems, 
see  e.g.,  [3].  Feed-forward  networks  can  accommodate  dynamics  by  having 
a  lag  space  of  past  input  and  target  values;  however,  a  fully  recurrent  net¬ 
work  with  internal  feedbacks  allows  for  even  more  sophisticated  dynamics. 
While  fully  RNN  architectures  are  the  ultimate  tool  for  modeling  dynamic 
relations,  the  comprehension  of  the  networks  is  a  challenging  subject  of  on¬ 
going  research.  Theoretical  investigations  of  modeling  capabilities  of  RNNs 
have  been  reported,  see  e.g.,  [2],  [4],  [7].  However,  to  the  authors  knowledge, 
there  is  no  general  theory  of  the  dynamic  behavior  of  a  general  RNN  except 
for  very  special  models  like  the  Hopfield  network,  see  e.g.,  [3].  This  indeed 
indicates  that  theoretical  analysis  of  RNNs  is  extremely  complicated.  On 
the  other  hand,  one  might  pursue  a  more  computational  approach.  The  gen¬ 
eral  computational  tools  from  non-linear  dynamic  systems  analysis  like  phase 
portraits,  stability  analysis,  measurement  of  fractal  dimensions  or  Lyapunov 
exponents  (see  e.g.,  [1],  [3])  may  be  applied  to  the  analysis  of  RNNs. 

The  motivation  for  this  paper  is  evaluation  and  interpretation  of  trained 
recurrent  networks,  and  to  suggest  and  discuss  simple  operational  techniques. 
In  particular,  we  focus  on  the  learning  curve  and  present  a  new  method  to 
determine  the  effective  memory  of  a  recurrent  network  which  conveys  the 
relevant  time  scale  of  the  dynamics. 
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NETWORK  ARCHITECTURE 

The  objective  is  to  model  a  non-linear  dynamic  relation  among  a  discrete¬ 
time  input  signal  x{t)  and  a  discrete  time  target  signal,  d{t).  The  general 
architecture  of  the  RNN  considered  in  this  presentation  is  based  on  [5]  and 
consists  of  a  single  hidden  layer  of  fully  connected  nonlinear  units  and  one 
output  unit.  In  particular,  we  focus  on  a  network  with  only  one  external  input, 
viz.  the  most  recent  value,  x{t).  That  is,  the  only  information  available  about 
previous  inputs  stems  from  the  memory  build  up  internally  in  the  net.  The 
advantage  using  these  networks  is  that  the  tedious  problem  of  determining 
the  optimal  lag  space  of  previous  inputs  is  converted  into  determining  the 
optimal  network  architecture  in  terms  of  connections  and  number  of  hidden 
neurons. 

The  network  has  a  linear  output  in  order  to  allow  for  arbitrary  dynamic 
range,  and  at  time  t  the  prediction  of  the  target  d{t)  is  given  by, 


Nh 

y(it)  =  ^  ^  "i”  '^ob  (^) 

i=l 

where  Nh  is  the  number  of  hidden  units,  Woi  is  the  weight  to  the  output  unit 
from  hidden  unit  i  and  Wob  is  the  output  bias  weight.  The  zth  state,  Si(t),  is 
the  output  of  a  hidden  unit  computed  as 

Si{t)  =  f  -  1)  +  j  (2) 

where  wij  is  the  weight  to  hidden  unit  i  from  hidden  unit  j,  Wix  is  the  weight 
from  the  external  input  x{t),  and  Wib  is  the  bias  weight.  /(•)  is  the  nonlinear 
activation  function  tanh(a:).  Note  that  the  update  of  the  units  is  layered  [5]: 
at  each  time  step  the  hidden  units  are  updated  before  the  output  unit. 

TRAINING  AND  GENERALIZATION 

Suppose  we  have  a  training  set  of  related  values  of  inputs  and  targets  T  = 
{x{t),d{t)}f^i  where  T  is  the  number  of  training  samples.  Training  is  done 
by  adjusting  the  weights  so  as  to  minimize  a  cost  function.  Here  we  employ 
the  sum  of  squared  errors  augmented  by  a  simple  weight  decay  regularization 
term  ^ 

^  t=i 

where  w  is  the  concatenated  set  of  weights  and  o  is  a  small  regularization 
parameter.  Training  aims  at  minimizing  the  cost  function  C'(w)  and  is  thor¬ 
oughly  treated  for  RNNs  in  [6]. 
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Suppose  that  training  provides  the  estimated  weight  vector  w.  Let  tt  be  an 
initial  state  vector^  of  the  “true”  data  generating  system  leading  to  the  train¬ 
ing  set  T  and  define  an  associated  probability  distribution^  p(7r)-  Further, 
define  x(t)  =  [x{t),x{t  —  !),••  ■x{T  -H  1)]^  and  let  p  |  T,7r) ,  t>T, 

be  the  true  joint  probability  density  function  of  [(i(^),x(i)]  conditioned  on 
the  initial  state  tt  and  the  training  set  T.  The  true  joint  p.d.f.  is  assumed 
to  be  time-independent  (i.e.,  stationary).  The  generalization  error  of  the 
trained  net  is  defined  as  the  expected  squared  prediction  error  on  future  data 
immediately  succeeding  the  training  data,  i.e.,  for  t  > 

G{w)  =  j [d{t)  -  y{t\'w)f  •jp(d(t),x(^)  |r,7r)  •  p{n)  dd{t)dx{t)d7v  (4) 

Thus  the  generalization  error  is  the  ensemble  average  of  the  squared  error 
over  1)  possible  realizations  of  [d(t),x(f)]  due  to  inherent  stochastic  processes 
in  the  data  generating  system,  and  2)  over  possible  initial  states  leading  to 
the  particular  training  set. 

We  estimate  the  generalization  error  by, 

.  T+v 

G(w)  =  -  e^(i;w)  (5) 

t=T+l 

where  V  is  the  number  of  test  samples. 

LEARNING  CURVE 

The  learning  curve  expresses  the  average  generalization  error  over  all  possible 
training  sets  of  a  particular  size  T  as  a  function  of  T  and  is  an  important 
tool  for  verifying  whether  enough  data  is  available  for  proper  training  of  the 
network.  Moreover,  the  shape  of  the  curve  provides  insight  into  the  nature 
of  the  problem  as  demonstrated  in  the  experimental  section. 

Practical  considerations  may  lead  to  more  restricted  definitions.  Here  we 
compute  the  learning  curve  as  the  estimated  generalization  error  when  grad¬ 
ually  expanding  the  training  set.  That  is,  there  is  no  average  over  different 
sets  of  a  particular  size. 

NETWORK  MEMORY 

A  characteristic  of  recurrent  neural  networks  is  their  ability  to  build  up  an 
internal  memory  representing  the  “history”  of  previous  inputs  on  which  the 
predictions  of  future  values  is  based.  The  significance  of  this  internal  memory 
is  especially  clear  when  using  RNNs  having  only  one  external  input.  Without 
the  ability  to  create  internal  memory  this  class  of  networks  would  be  useless. 

Once  a  recurrent  network  is  trained,  the  basic  idea  here  is  to  define  an 
integer  variable  M  which  expresses  the  effective  memory  of  past  values  of 

^The  initial  state  captures  the  all  information  about  the  time  series  for  t  <  0. 

^E.g,,  that  all  initial  states  are  equally  likely. 
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the  input  signal  x{t).  The  memory  thus  provides  a  partial  insight  into  the 
functionality  and  dynamics  of  the  network.  The  experimental  section  gives 
examples  of  interpreting  the  dynamics  using  this  simple  concept.  Recurrent 
networks  with  only  one  external  input  can  not  give  individual  contribution  to 
each  previous  input  x{t  —  m)  but  must  store  their  own  representation.  Con¬ 
sequently,  the  RNN  has  a  certain  memory  profile.  We  are  currently  pursuing 
the  idea  of  determining  the  memory  profile. 

A  feed-forward  network  does  not  possess  any  internal  memory,  i.e.,  the 
memory  is  explicitly  determined  by  the  memory  contained  in  the  preprocess¬ 
ing  of  the  input  signal.  The  standard  approach  is  to  feed  the  signals  from  a 
tapped  delay  line  [x{t),x{t  —  l),  •  •  •  x{t—M)]  into  the  network  and  the  memory 
thus  equals  M. 

The  capacity  of  the  internal  memory  of  a  recurrent  network  increases  when 
the  number  of  hidden  units  (i.e.,  the  dimension  of  the  state  vector)  increases 
as  the  state  vector  contains  all  information  about  previous  inputs.  However, 
to  our  knowledge,  there  is  no  reports  on  quantizing  the  notion  of  memory  in 
recurrent  networks.  In  the  following  we  attempt  to  provide  a  definition  of  the 
memory  of  a  specific  trained  recurrent  network. 

The  output  from  the  RNN  defined  in  (1),  (2)  is  based  on  the  current  and 
-  in  principle  -  infinitely  many  previous  inputs^,  as  shown  by, 

y{t)  =  y  (i|w,  x{t),x{t  -  1), ,  x(-oo)) .  (6) 

In  order  to  determine  the  effective  average  memory  of  the  recurrent  network 
we  suggest  to  evaluate  an  estimate  of  the  generalization  error,  i.e.,  prediction 
error  on  a  test  set,  using  predictions  based  on  only  a  limited  number  of 
previous  inputs.  This  generalization  error  is  then  compared  to  the  error 
obtained  using  all  -  in  principle  infinitely  many  -  previous  inputs. 

In  particular,  when  evaluating  the  generalization  error  using  only  the  m 
most  recent  inputs,  we  compute, 

1 

Gm(w)  =  —  [d(t)-j/(i|w,x((),x(t-l),...,x(t-m))]^,  TO>0  (7) 

^  t=T+l 

where  V  is  the  size  of  the  test  set.  y  {t\w,x{t),x{t  —  1), . . .  ,x{t  —  m))  is 
computed  for  each  t  e  [T  -|- 1;  T  -I-  V]  by  resetting"^  the  states  Si{t  -  m  -  1), 
i  —  1, 2,  •  •  • ,  A^/i,  to  zero  and  then  iterate  the  network  from  time  t  m  until 
time  t,  using  the  output  y{t)  at  this  time  as  the  prediction  of  d{t).  In  the 
first  iteration,  calculating  y{t-  m\w,x{t  -  m)),  the  network  thus  functions 
as  a  feed-forward  network  since  the  previous  states  of  the  hidden  units  - 
and  thereby  all  previous  external  inputs  -  have  no  influence  on  the  network 
output.  Then,  the  network  gradually  builds  up  a  representation  of  the  past  in 

^This  is  also  true  for  a  RNN  in  which  previous  values  of  the  output  is  fed  back  to  the 
input. 

^Setting  the  hidden  unit  states  Si(t  — m  —  1)  to  zero  is  equivalent  to  erasing  the  memory 
of  the  network  regarding  inputs  before  time  t  —  m. 
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the  hidden  units  during  the  next  m+1  iterations  before  it  makes  its  prediction 

at  time  t.  ^  ^  ^ 

The  resulting  errors  Go(w),  G'i(w), . . .  are  then  compared  to  Goo(w)  de¬ 
noting  the  error  obtained  when  using  all  available  previous  inputs,  i.e.,  no 
resetting  of  the  hidden  unit  states  at  any  time.  The  memory  M  is  now  de¬ 
fined  as, 


M  =  inf 


Mm!  >m^ 


|5m^(w)  -Gcx)(w)| 
Soo(w) 


(8) 


where  e  is  a  small  number.  Thus,  the  memory,  M,  denotes  the  minimal 
number  of  previous  inputs  beyond  which  additional  inputs  are  insignificant. 

The  memory  measure  outlined  above  determines  the  number  of  previous 
inputs  that  the  network  needs  knowledge  about  in  order  to  obtain  good  pre¬ 
dictions  on  all  samples  in  the  test  set.  Thus  the  measure  can  be  interpreted 
as  the  average  memory  of  the  network.  A  recurrent  network,  however,  is  a 
dynamic  system  whose  internal  characteristics  can  be  highly  influenced  by 
the  nature  of  the  input  series.  Especially,  if  the  input  series  exhibits  regions 
of  non-stationary  behavior,  the  network  dynamics  including  memory  must 
clearly  be  affected.  Such  changes  in  dynamics  are  not  captured  by  the  aver¬ 
age  memory  measure  and  we  may  define  a  local  memory,  in  accordance  with 
(8),  using  a  local  generalization  error  estimate^ 


G.n(4;w)  =  i  y;  [d(t')-y{t'\<^,x(t'),x{t' (9) 

where  m  >  0,  i  >  T,  and  1  <  AT  <  ^  is  the  size  of  a  smaller  test  set.  Choosing 
K  too  small  gives  rise  to  a  very  noisy  measure  of  the  generalization  error; 
however,  in  principle  a  good  resolution  of  changes  in  memory  requirement. 
On  the  other  hand,  increasing  K  improves  generalization  accuracy  but  reduces 
the  resolution  of  changes  in  memory. 


EXPERIMENTS 

The  proposed  methods  for  estimating  the  learning  curves  and  memory  are 
evaluated  on  two  chaotic  time  series  prediction  problems,  viz.  the  laser  series 
from  the  Santa  Fe  time  series  competition  [9]  and  the  artificially  generated 
Mackey-Glass  series  [8]. 

The  laser  series  is  illustrated  in  the  left  panel  of  Figure  1.  Let  z{t)  denotes 
the  series,  then  identification  is  done  by  training  the  network  to  perform  a  one 
step  ahead  prediction,  i.e.,  we  use  x{t)  =  z{t)  and  d{t)  =  z{t-{‘l).  All  available 
10093  samples  are  used  and  scaled  to  zero  mean  and  unit  variance.  From 
these  data  we  construct  a  learning  curve.  The  training  series  are  obtained  by 

SNotice,  by  defining  this  measure  for  alU  >  T  some  of  the  first  values  are  based  partly 
on  training  examples. 


86 


SANTA  PP  LASPR  SPRtPS 


NUMBER  OF  TRAINING  EXAMPLES 


Figure  1:  Left  panel:  The  Santa  Fe  laser  series.  Right  panel:  Learning  curve  for 
the  laser  data.  Dots  denote  error  for  individual  nets,  the  connected  circles  indicate 
the  average. 


extending  backwards  in  time  from  point  7000  and  the  last  3093  points  in  the 
series  are  used  as  test  series.  For  instance,  a  training  set  of  size  1000  involves 
training  using  2:(6000)  through  2:(7000).  The  employed  nets  have  one  external 
input  and  ten  hidden  units.  For  each  number  of  increasing  training  set  sizes, 
we  train  ten  networks  using  different  random  initial  weights  and  compute  the 
resulting  normalized  mean  squared  error  (NMSE)  on  the  test  set.  NMSE  is 


defined  by 


NMSE  = 


var(d(t)) 


(10) 


where  t  runs  over  the  set  S  in  question  (i.e.,  either  training  or  test  set),  |<S| 
is  the  size  of  the  set,  and  w(*)  denotes  the  empirical  variance. 

The  learning  curve  is  shown  in  the  right  panel  of  Figure  1.  Initially  the  test 
error  drops  as  the  size  of  the  training  set  is  increased,  but  from  training  set  size 
2500  to  5500  the  average  test  error  is  fairly  constant.  This  can  be  explained  by 
visual  inspection  of  the  laser  series  as  the  “shape”  of  many  collapses  between 
the  corresponding  points  1500-4500  seems  atypical  for  the  test  series.  We  see 
a  significant  drop  in  test  error  when  increasing  the  training  set  size  from  5500 
to  6000  points  which  might  be  explained  by  the  fact  that  the  training  set  now 
incorporates  an  additional  collapse  very  similar  in  shape  to  the  ones  in  the 
test  series.  These  observations  suggest  that  for  the  laser  series,  the  concept  of 
an  example  should  be  conceived  on  several  time  scales:  there  are  the  pointwise 
examples  corresponding  to  each  single  input  presented  to  the  network;  but 
more  important,  there  obviously  exists  “super  examples”  consisting  of  a  whole 
section  of  the  time  series.  If  additional  super  examples  or  sections  are  not 
similar  to  the  sections  encountered  in  the  test  series,  generalization  will  not 
improve  as  seen  in  the  right  panel  of  Figure  1. 

We  now  examine  the  memory  of  selected  networks.  The  left  panel  of 
Figure  2  depicts  the  normalized  version  of  Eq.  (7)  for  increasing  values  of 
lag  space  m  when  evaluating  one  of  the  networks  with  low  test  error  trained 
on  7000  examples.  The  horizontal  dotted  line  indicates  the  normcJized  level 


87 


Figure  2:  Left  panel:  Measuring  average  memory  for  one  of  the  networks  with  low 
generalization  error  trained  on  7000  examples  from  the  laser  series.  Right  panel: 
Measuring  average  memory  for  another  of  the  networks  trained  on  7000  examples. 

G^oo(w)  using  all  available  previous  inputs.  It  seems  that  the  network  has 
a  memory  somewhere  between  120  and  200.  The  precision  e  in  (8)  denotes 
a  level  below  which  we  consider  the  two  errors  as  equivalent.  The  value  of 
the  memory  thus  naturally  depends  on  the  choice  of  e  as  shown  in  Table  1. 
In  the  right  panel  of  Figure  2  the  normalized  test  error  for  increasing  lag 


e 

0.05 

0.025 

0.01 

M 

150 

183 

198 

Table  1:  The  value  of  the  memory  dependence  on  e  for  curve  in  the  left  panel  of 
Figure  2. 

space  m  for  another  of  the  nets  trained  on  7000  points  is  shown.  We  note 
that  for  this  network  the  memory  M  is  less  sensitive  to  e,  as  it  is  between 
23-25  for  e  <  0.18.  We  also  note  that  the  memory  is  much  shorter  than  for 
the  previous  network  even  though  the  test  errors  are  almost  identical.  Note, 
since  the  network  complexity®  is  restricted,  a  network  with  short  memory  is 
able  to  allow  for  more  individual  contribution  of  each  of  the  previous  inputs 
x{t  —  n)  than  a  network  with  long  memory.  The  memory  profile  of  a  short 
term  memory  net  is  thus  more  fine  grained  than  that  of  a  long  term  memory 
net  (with  the  same  complexity).  One  might  claim  that  a  compact  memory 
model  is  better  tuned  to  the  problem. 

In  the  left  panel  of  Figure  3  we  illustrate  the  average  memory  of  the 
network  with  lowest  test  error  when  training  on  only  500  examples.  We  noUce 
that  by  limiting  the  memory  the  error  can  actually  become  lower  than  Gqq. 
This  effect  often  occurs  for  overtrained  networks  which  is  also  the  case  here. 
The  memory  of  the  network  is  highly  specialized  on  the  training  set;  limiting 
the  memory  acts  as  regularization  and  actually  improves  the  performance  on 
the  test  set. 

We  now  illustrate  that  the  memory  of  a  recurrent  network  indeed  is  a 
®E.g.,  measured  by  the  number  of  hidden  neurons. 


88 


10 


PREVIOUS  I  OF  SAMPLES,  m  TIME 


Figure  3:  Left  panel:  Measuring  average  memory  for  best  network  trained  on  500 
examples  from  the  laser  series.  Right  panel:  Measuring  local  memory  with  threshold 
e  =  0.01  using  five  point  average,  K  =  b. 

dynamic  quantity  by  examining  the  local  memory  defined  by  Eq.  (8)  and  (9) 
for  the  network  whose  average  memory  is  shown  in  the  left  panel  of  Figure  2. 
The  right  panel  of  Figure  3  and  the  left  panel  of  Figure  4  illustrate  the 
dynamic  memory  measure  using  precision  e  =  0.01  and  averaging  over  K  =  b 
and  K  =  50  examples,  respectively.  The  memory  is  seen  to  be  very  dependent 
upon  where  in  the  laser  series  it  is  measured;  the  closer  to  a  collapse,  the 
larger.  The  memory  required  around  the  last  collapse  is  significantly  larger 
than  around  the  previous  collapses.  This  may  be  explained  by  the  observation 
that  the  characteristics  of  the  laser  series  just  before  the  last  collapse  is  highly 
atypical  from  the  rest  of  the  test  series.  The  memory  in  the  right  panel  of 
Figure  3  averaging  over  only  K  =  5  previous  errors  is  seen  to  be  a  very  noisy 
quantity.  As  K  is  increased  the  error  measure  becomes  smoother.  Recall 
from  Table  1  that  the  average  memory  for  e  =  0.01  is  M  =  198;  however,  the 
illustrations  of  the  local  memory  shows  that  by  omitting  the  last  collapse  the 
average  memory  would  be  measured  to  150,  approximately. 

The  Mackey-Glass  series  is  a  standard  problem  of  nonlinear  dynamics  and 
results  from  the  integration  of  a  differential  equation,  see  e.g.,  [8].  Standard 
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time  number  of  training  examples 

Figure  4:  Left  panel:  Measuring  local  memory  with  threshold  e  =  0.01  using  fifty 
point  average,  K  =  bO.  Right  panel:  Learning  curve  for  the  Mackey-Glass  series. 
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Figure  5;  Measuring  average  memory  for  networks  trained  on  1500  examples  from 
the  Mackey-Glass  series.  Left  Panel:  Network  having  short  memory.  Right  panel. 
Network  having  long  memory. 

practice  is  to  implement  a  six  step  ahead  predictor,  i.e.,  modeling  z{t)  from  a 
lag  space  vector  x(t)  =  [z{t  -  G),z{t  -  12), ,  z{t  -  6n/)]  using  feed-forward 
networks.  Here  we  implement  the  six  step  ahead  predictor  with  target  value 
d(t)  =  z{t)  using  a  recurrent  network  with  only  one  external  input,  x{t)  = 
z{t-6),  and  ten  hidden  units.  In  the  right  panel  of  Figure  4  is  shown  a  learning 
curve  for  the  Mackey-Glass  series  when  training  on  up  to  1500  samples  and 
testing  on  the  following  7000  samples.  For  each  training  set  size  ten  networks 
were  trained.  The  learning  curve  indicates  that  more  than  1000  examples  are 
needed  in  order  to  obtain  consistently  good  results  on  the  test  set.  We  then 
determined  the  average  memory  defined  by  Eq.  (7)  for  the  properly  trained 
networks  with  the  lowest  errors  on  the  test  set.  Using  the  threshold  e  =  0.01 
we  found  that  the  networks  implemented  a  memory  in  the  range  of  118-263, 
as  seen  from  Figure  5. 

The  memories  implemented  by  the  recurrent  networks  are  surprisingly 
long.  In  order  to  obtain  comparable  performance  using  feed-forward  networks 
six  external  inputs  are  needed,  thus  spanning  a  total  of  only  31  previous  sam¬ 
ples.  This  is  the  minimal  memory  neccessary  for  good  performance  provided 
weighting  of  individual  lags  is  possible,  however,  a  RNN’s  memory  profile  is 
more  coarse  grained  reducing  the  possibilty  of  individual  weighting.  Further¬ 
more,  maintaining  information  about  all  previous  input  values  seems  to  bias 
recurrent  networks  towards  the  implementation  of  a  long  effective  memory. 

The  long  memory  implemented  by  the  recurrent  networks  seems  to  be 
of  prime  importance  for  the  robustness  of  these  models.  Preliminary  experi¬ 
ments  indicate  that  recurrent  networks  are  far  more  resilient  to  noise  pertuba- 
tions  of  the  input  data  than  comparable  feed-forward  networks.  Examination 
of  the  robustness  of  recurrent  networks  is  a  topic  of  ongoing  research. 

CONCLUSION 

In  this  paper  we  have  focused  on  determining  the  effective  memory  of  re¬ 
current  neural  networks  when  used  for  time  series  processing,  equivalent  to 
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the  span  of  the  externally  provided  lag  space  for  feed-forward  networks.  In 
particular,  we  have  suggested  an  operational  definition  which  measures  the 
memory  of  a  fully  trained  RNN  on  a  test  set.  The  viability  of  the  method  is 
illustrated  on  two  chaotic  time  series  problems. 
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Abstract.  In  this  contribution,  we  suggest  a  convenient  way  to 
use  generalisation  error  to  extract  the  relevant  delays  from  a  time- 
varying  process,  i.e.  the  delays  that  lead  to  the  best  prediction  per¬ 
formance.  We  design  a  generalisation-based  algorithm  that  takes 
its  inspiration  from  traditional  variable  selection,  and  more  pre¬ 
cisely  stepwise  forward  selection.  The  method  is  compared  to  other 
forward  selection  schemes,  as  well  as  to  a  non-parametric  tests 
aimed  at  estimating  the  embedding  dimension  of  time  series.  The 
final  application  extends  these  results  to  the  efficient  estimation  of 
FIR  filters  on  some  real  data. 


OVERVIEW 

In  system  identification  as  well  as  in  time  series  modelling,  the  choice  of  the 
inputs  to  our  model  plays  a  crucial  role.  In  order  to  obtain  good  performance, 
one  shall  model  future  behaviour  from  a  set  of  relevant  past  measurements. 
An  insufficient  amount  of  inputs  will  prevent  the  model  from  capturing  the 
underlying  mapping.  On  the  other  hand,  including  irrelevant  inputs  will  lead 
to  poor  prediction  performance,  as  suggested  by  the  “curse  of  dimensional¬ 
ity”. 

In  this  contribution,  we  consider  a  method  aimed  at  finding  a  set  of  rel¬ 
evant  delays.  For  that  purpose,  we  use  a  suboptimal  iterative  method  that 
minimises  the  estimated  generalisation  error,  and  bears  resemblance  to  the 
usual  statistical  variable  selection  methods  [6].  However,  this  Extraction  of 
Relevant  Delays  (ERD)  method  is  original  in  the  fact  that  1)  it  assesses  the 
relevance  of  possible  inputs  on  the  basis  of  generalisation,  and  2)  it  is  adapted 
to  time  dependant  problems. 

The  organisation  of  this  paper  is  as  follows:  first  we  give  a  short  pre¬ 
sentation  of  the  topic  of  statistical  variable  selection,  and  describe  our  ERD 
method.  We  then  introduce  briefly  a  class  of  methods  estimating  the  em- 
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bedding  dimension  of  time  series.  The  second  part  of  the  paper  contains 
a  number  of  experiments  conducted  on  the  well-known  Henon  map,  on  a 
real  time  series,  and  finally  on  a  FIR  filtering  problem.  We  conclude  with  a 
discussion  of  the  results. 


INPUT  SELECTION 

Let  us  consider  a  standard  time  series  modelling  problem.  A  sequence  x 
of  measurements  is  collected,  and  we  try  to  predict  xt  from  a  set  of  past 
values  xt-d-  Note  that  in  that  setting,  the  length  of  the  basic  time  delay  (i.e. 
difference  between  t  and  i  +  1)  is  imposed  on  us.  Extracting  the  relevant 
delays  consists  in  finding  a  set  of  m  delays  {xt-di  ?  •  •  • ,  ^t-dm )  given  as 
input  to  a  model,  yields  the  best  prediction. 

This  is  a  special  case  of  variable  selection,  which  in  turn  can  be  seen  as 
part  of  the  more  general  problem  of  analysing  the  structure  in  the  data  [6]. 
An  important  assumption  in  conventional  variable  selection  is  that  all  neces¬ 
sary  variables  are  available,  i.e.  a  sufficient  subset  of  inputs  actually  exists. 
Provided  that  data  are  sampled  correctly,  this  assumption  is  usually  satisfied 
in  the  case  of  time  series^.  We  will  use  the  terms  ‘variable’,  ‘input’  or  ‘delay’ 
indifferently  when  addressing  our  time  series  modelling  problem. 

An  exhaustive  search  through  all  possible  subsets  of  inputs  is  usually 
impossible  for  combinatorial  reasons.  A  number  of  suboptimal  techniques 
have  thus  been  designed,  among  them  stepwise  methods: 

•  Forward  selection  methods  consists  in  starting  from  an  empty  set  of 
inputs,  and  adding  variables  one  after  the  other  according  to  a  given 
selection  criteria,  until  a  chosen  stopping  condition  is  fulfilled. 

•  On  the  contrary,  backward  elimination  methods  start  with  the  full  set  of 
inputs,  and  proceed  by  deleting  one  variable  at  a  time  according  to  the 
selection  criteria,  until  the  stopping  condition  is  reached.  In  the  field  of 
neural  computation,  variable  selection  techniques  based  on  pruning  [2] 
are  a  typical  example  of  backward  elimination. 

Stepwise  regression  usually  refers  to  a  combination  of  both  (in  the  linear 
case).  For  both  methods,  the  crucial  parts  are  the  design  of  the  selection 
criteria,  and  the  stop  condition.  Conventional  methods  in  linear  regression 
rely  on  e.g.  correlation  coefficients,  information  content  or  F-testing. 


EXTRACTION  OF  RELEVANT  DELAYS 

We  present  here  a  method  of  Extraction  of  Relevant  Delays  (ERD)  that 
relies  upon  generalisation  error.  It  draws  its  inspiration  from  forward  selec- 

^It  breaks  down  in  the  case  where  a  long-term  delay  is  needed,  that  ranges  further  than 
the  time  period  spanned  by  the  data.  However,  the  relevance  of  such  long-term  prediction 
is  questionable,  and  there  would  be  no  data  to  identify  the  associated  parameter(s)  anyway. 
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tion  methods,  combined  with  generalisation  estimation.  Consider  a  model 
/  providing  a  mapping  from  an  input  vector  containing  m  delays  = 
{xt-di)i=i  rn  output  Xt,  and  assume  Gaussian  perturbation  on  the  out¬ 
put.  'We  define  the  generalisation  error  (or  expected  risk)  for  this  model 
as:  2 

G(f)  =  j  (/  -  xtj  p  dxtdx^^^  (1) 

Obviously,  equation  (1)  can  not  be  used  directly  as  the  joint  input-output 
probability  is  unknown.  We  will  thus  resort  to  estimating  this  error,  or  rather 
its  average  over  all  possible  training  sets  of  a  given  size  N.  There  are  mainly 
two  classes  of  such  estimators:  methods  such  as  cross-validation  [17]  resample 
the  available  data,  while  algebraic  estimators  [1]  rely  on  statistical  arguments. 

Consider  for  example  the  second  option.  Many  estimators  have  been 
proposed  in  the  literature,  e.g.  Final  Prediction  Error  (FPE)  [1],  Generalised 
Prediction  Error  (GPE)  [11],  Final  Prediction  Error  for  Regularised  problems 
(FPER)  [7]  or  Network  Information  Criterion  (NIC)  [12].  We  will  here  settle 
for  an  expression  similar  to  GPE,  i.e.  a  FPE  where  the  number  of  parameters 
is  replaced  by  the  number  of  efficient  parameters  P: 

■  (Jr?)  ® 

where  (5)  is  the  average  training  error  (over  all  training  sets  of  size  N).  As 
such  an  average  is  not  available,  we  plug  the  measured  training  error  (or 

empirical  risk)  S{f)  instead.  For  quadratic  risk,  5  (/)  =  E  (/  (^^*0  -  Xt)  . 
The  calculation  of  P  depends  on  the  regularisation  method  used  during  train¬ 
ing  (see  e.g.  [7,  3]). 

The  proposed  ERD  method  is  a  forward  method  taking  all  delays  in  their 
natural  order  (which  bypasses  the  selection  criteria),  and  adds  a  candidate 
input  if  and  only  if  it  corresponds  to  a  significant  decrease  in  generalisation 
error.  The  algorithm  can  be  described  as  follows: 

1.  Initialise:  d  =  0;  Gmin  =  o'xi  no  input  selected. 

2.  Model:  d  =  d  +  1;  add  delay  t  -  d  to  selected  inputs;  estimate 
generalisation  error  G  for  resulting  model. 

3.  Test:  if  G  is  significantly  smaller  than  Gmim  koep  delay  t  —  d; 
Gmin  =  6.  Discard  otherwise. 

4.  Iterate:  Go  to  step  2  until  stop  condition  is  reached.  _ 


Significant  decrease  in  error.  When  a  candidate  delay  yields  a  decrease 
in  (estimated)  generalisation  error,  step  3  requires  that  we  assess  the  sig¬ 
nificance  of  this  decrease.  We  take  advantage  of  the  fact  that  the  generali¬ 
sation  estimators  mentioned  above  are  based  on  averaging  a  statistics,  and 
test  whether  the  statistics  associated  with  two  different  generalisation  esti¬ 
mators  have  statistically  significantly  distinct  means  by  performing  a  paired 
t-test  [15,  8]. 
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In  our  case,  the  estimated  (average)  generalisation  error  given  by  FPE 
can  be  expressed  as  the  following  average: 


where  is  the  local  risk  (e.g.  squared  residuals)  for  training  example  k 
and  a  model  parameterised  by  w.  Let  us  consider  two  models  trained  on  the 
same  set  of  examples,  and  Pi  and  P2  the  numbers  of  efficient  parameters  for 
the  first  and  second  model  (respectively).  The  distribution  of  the  corrected 

residuals  e^w}  (resp.  kas  mean  (Gi)  (resp.  (§2)).  We 

\N—Pi/  \N—P2/  ^ 

thus  test  whether  {G2)  is  significantly  smaller  than  (Gi)  by  using  a  paired 

t-test  on  the  corrected  residuals. 

The  case  of  cross-validation  is  somewhat  more  straightforward.  The  leave- 
one-out  (LOO)  cross-validation  score  is  calculated  by  averaging  the  predic¬ 
tion  error  on  one  example  for  a  model  trained  on  the  remaining  sample: 

t=i 

Where  ft  is  the  model  trained  without  example  (x^^^Xf).  For  two  different 
models,  the  residuals  are  paired  according  to  the  example  left  out,  so  that 
a  (paired)  t-test  can  be  used  to  determine  whether  these  residuals  come 
from  distribution  with  different  mean,  i.e.  correspond  to  different  average 
generalisation  errror.  Extension  to  m-fold  cross-validation  is  straightforward. 


EMBEDDING  DIMENSION 

In  the  study  of  non-linear  dynamical  systems,  and  time  series  in  particular, 
an  important  problem  lies  in  finding  the  embedding  dimension  [16],  which 
is  essentially  equivalent  to  finding  the  set  of  primary  delays  in  time  series. 
In  the  realm  of  neural  computation,  the  recently  proposed  5-test  method[14] 
addresses  this  issue.  In  a  different  field,  a  method  for  identifying  the  order 
of  non-linear  input-output  systems  was  proposed  [5],  that  relies  on  the  use 
of  “Lipschitz  quotients”  i.e.  ratio  between  output  and  input  distances.  A 
similar  method  applied  to  time  series  (called  ‘geometrical  technique’)  was 
presented  last  year  at  this  workshop  [10]. 

Though  different  in  practice,  these  methods  rely  on  a  common  assumption 
on  the  continuity  of  the  underlying  mapping,  and  use  a  geometrical  approach 
based  on  the  data  alone.  The  continuity  argument  means  that  if  there  is  a 
mapping  between  and  Xt,  then  close  inputs  x^"^  and  x^"^  should  corre¬ 
spond  to  close  outputs  Xy  and  Xu-  Accordingly,  as  long  as  the  input  space 
is  insufficient  (i.e.  missing  delays),  close  inputs  can  correspond  to  arbitrarily 
distant  outputs.  Quantifying  this  is  done  either  by  measuring  empirical  prob¬ 
abilities  that  two  outputs  are  close  given  that  the  corresponding  inputs  are 
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close  (5-test),  or  by  calculating  the  ratio  between  output  and  input  distances 
(Lipschitz  quotients). 

It  should  be  noted  that  these  methods  are  non-parametric.  They  rely  on 
the  data  alone,  and  need  not  specify  a  given  model  (contrary  to  the  ERD 
method).  This  can  turn  out  to  be  a  disadvantage  since  for  a  given  data  set, 
they  always  select  the  same  set  of  relevant  delays,  regardless  of  the  ability 
of  our  model  to  actually  implement  the  underlying  mapping.  It  could  very 
well  be  that  for  the  model  at  hand,  the  estimation  would  benefit  from  the 
inclusion  of  a  secondary  delay,  as  shown  in  the  next  section  and  discussed 
further  down.  Furthermore,  these  geometrical  techniques  require  extensive 
calculations,  as  they  consider  all  pairs  of  data.  They  are  thus  computationally 
expensive. 


TIME  SERIES  EXPERIMENTS 

This  section  is  devoted  to  two  simple  experiments.  First  we  use  an  artificial 
problem  (the  Henon  map),  for  which  a  large  validation  set  confirms  the  re¬ 
sults  obtained  by  our  ERD  method.  In  the  second  experiment,  we  discover 
interesting  long  term  dependencies  on  a  real  time  series. 

The  Henon  map  is  implemented  by  the  following  mapping:  xt  =  l  — 
+  0.3a;t_2.  We  generate  a  training  set  containing  500  data,  and  a  test 
set  of  10000  elements  for  assessing  generalisation  abilities.  We  experiment 
on  non-noisy  as  well  as  noisy  data,  with  al  =  0.1.  Two  different  models 
are  used:  a  linear  model  (obviously  ill-suited  to  this  purpose)  and  a  non- 
parametric  kernel  smoother.  The  generalisation  estimators  are  the  FPE  and 
LOO  respectively. 

In  order  to  check  whether  the  delays  are  wisely  chosen,  experiments  are 
performed  comparing  the  ERD  method  and  other  selection  methods  (table  1): 

1.  a  forward  selection  methods  using  a  large  validation  set  (distinct  from 
the  test  set)  of  10000  data; 

2.  the  Fgg-inclusion,  a  selection  scheme  based  on  the  F-statistics  [6]; 

3.  the  5- test  [14]. 

As  shown  on  table  1,  all  forward  selection  methods  outperform  the  5-test 
in  the  linear  case:  a  linear  combination  of  the  first  two  delays  is  obviously 
insufficient  to  model  the  mapping.  The  performance  is  rather  homogeneous 
among  forward  selection  methods,  though  the  ERD  method  tends  to  favour 
parsimonious  models,  while  keeping  good  generalisation  abilities. 

On  the  non-noisy  data,  the  kernel  smoother  captures  the  underlying  map¬ 
ping  in  all  cases.  When  the  training  data  is  noisy,  the  F-inclusion  scheme 
displays  a  severe  case  of  curse  of  dimensionality.  The  other  methods  select 
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Henon  map: 

Large 

validation 

set 

Delays 

1-7 

1-2 

1-7 

1-3 

MSE 

0.376 

0.000 

0.503 

0.214 

Generalisation 

0.000 

0.389 

0.067 

ERD 

Delays 

1,3-6 

1-2 

1,3,4 

1-3 

MSE 

0.376 

0.000 

0.523 

0.214 

Generalisation 

0.378 

0.000 

0.409 

0.067 

jpgg-inclusion 

Delays 

1-6 

1-2 

1-6,10 

1-8,11-13, 

16,17,19,20 

MSE 

0.376 

0.000 

0.499 

0.032 

Generalisation 

0.379 

0.000 

0.389 

0.294 

(5-test 

Delays 

1-2 

1-2 

MSE 

0.457 

0.000 

0.567 

0.266 

Gener. 

0.455 

0.000 

0.459 

0.097 

Table  1:  Results  on  the  noisy  and  non-noisy  Henon  map  data,  for  two  models:  a 
linear  model  and  a  non  parametric  Kernel  smoother.  MSE  is  the  Mean  Squared 
(training)  Error,  generalisation  is  estimated  on  10000  non-noisy  data. 

one  additional  delay  t  —  3.  As  we  will  discuss  later,  this  theoretically  unnec¬ 
essary  input  leads  to  an  improved  prediction  accuracy  on  both  the  training 
and  generalisation  set. 

Fraser  river  data.  As  an  example  of  real  time  series  processing,  we  will  use 
a  publicly  available  dataset^  containing  the  mean  monthly  flow  of  the  Fraser 
River  in  Hope,  British  Columbia,  from  march  1913  to  December  1990  [9]. 
It  is  a  roughly  periodic  data  set  containing  946  measurements  with  maxima 
every  11  to  13  months.  We  split  the  data  set  so  that  we  have  half  the  data 
for  training  and  half  for  testing  the  prediction  abilities  of  the  model.  In  the 
following  experiments,  we  use  the  log  values  of  the  data,  and  estimate  the 
parameters  by  minimising  the  Mean  Squared  Error  on  the  transformed  data. 

The  use  of  a  large  validation  set  is  not  possible  here  as  is  (unfortunately) 
the  case  with  most  real  life  problem.  We  will  compare  the  result  of  the  ERD 
scheme  to  the  results  provided  by  the  non-parametric  d-test.  According  to 
this  test,  the  embedding  space  of  the  time  series  involves  6  delays. 

Note  that  the  ERD  method  once  again  outperforms  the  method  based  on 
estimating  the  embedding  dimension.  The  linear  model  probes  further  into 
the  past,  and  spots  relevant  delays  up  to  t  —  48,  i.e.  four  times  the  time  span 
covered  by  the  J-test.  The  kernel  smoother  seems  to  be  experiencing  some 
problems  coping  with  the  dimensionality  of  the  data — they  could  probably 
be  minimised  using  a  variable  metric.  The  neural  networks  model  selects  the 
same  amount  of  delays  than  the  (5-test.  However,  the  selection  is  targeted 
towards  minimisation  of  generalisation  error,  which  is  reflected  in  the  sizeably 
smaller  test  error.  Noticeably,  the  non-linear  neural  network  model,  though 

^available  on  Statlib  at  http://lib.stat.cmu.edu/datasets/ 
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Fraser  river  : 

Linear 

Kernel 

Neural  net 

ERD 

Delays 

1,2,4-7,10,11, 

23,26,35,48 

1,2,4,7,11,13 

1,2,4,7,11,23 

MSE 

0.0529 

0.0389 

0.0425 

Generalisation 

0.0439 

0.0547 

0.488 

(5-test 

[14] 

Delays 

1,2,4,7,8,11 

MSE 

0.0680 

0.0441 

0.452 

Generalisation 

0.0609 

0.0530 

0.627 

Table  2:  Results  for  the  Fraser  river  data  set,  and  three  different  models.  MSB  is 
the  Mean  Squared  Error. 

using  a  combination  of  regularised  cost  optimisation  and  OBS  pruning  [4], 
does  not  manage  to  extract  longer-term  delay,  and  is  outperformed  by  the 
simpler  linear  model. 


OPTIMISING  FIR  FILTERS 

We  will  now  extend  the  method  and  apply  it  to  fMRI  signal  modelling.  The 
fMRI  signal  measures  the  hemodynamic  response  to  focal  neuronal  activation. 
The  data  is  collected  as  a  504  steps  time-series  containing  measurements 
corresponding  to  the  hemodynamic  response  to  a  series  of  periodic  baseline 
and  activation  periods  (7  periods  in  all).  The  data  is  corrupted  by  a  very 
high  level  of  noise. 

Modelling  this  response  as  a  function  of  the  activation  signal  is  the  object 
of  active  current  research  [13].  We  extend  the  above  method  to  optimise  the 
choice  of  relevant  delays  when  trying  to  model  the  response  with  a  FIR  filter 
applied  on  the  excitation  signal.  Current  attempts  at  doing  so  use  a  fixed  lag 
of  7  delays. 

The  ERD  method  is  simply  extended  by  testing  sequentially  chosen  delays 
in  the  excitation  signal  rather  than  the  time  series  itself.  We  applied  the 
method  on  5  voxels  that  were  identified  as  being  particularly  responsive  to 
the  excitation.  Out  of  the  504  measurements,  we  set  the  last  two  periods, 
or  144  data,  aside  for  testing  the  generalisation  abilities.  The  first  5  periods, 
containing  360  points,  are  used  for  identifying  the  relevant  delay  and  the  filter 
coefficients.  The  FPE  is  used  as  a  generalisation  estimate. 

On  the  5  fMRI  time  series  studied,  we  extracted  from  1  to  4  delays, 
ranging  from  f  - 1  to  f  -  22.  On  voxel  number  3  for  example,  our  experiments 
surprinsingly  select  only  t  —  1,  but  we  can  see  on  figure  1  (left  panel)  that 
this  actually  leads  to  a  slight  decrease  in  generalisation  error  compared  to  the 
fixed  7  delay  filter.  Overall,  the  results  displayed  on  the  left  panel  of  figure  1 
suggest  that  on  the  extremely  noisy  data,  the  method  leads  to  performance 
that  is  comparable  to  the  fixed  FIR  filter,  while  using  less  parameters. 

On  the  first  voxel,  the  extraction  of  relevant  delays  leads  to  a  noticeable 
decrease  in  generalisation  error.  The  right  panel  of  figure  1  plots  the  response 
of  voxel  1  together  with  both  FIR  estimation. 
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Comparison  between  fixed  lag  and  extracted  lag 


FIR  filters  vs.  actual  response  (voxel  1) 


Voxel  no.  Time  step 


Figure  1:  Left:  performance  comparison.  For  each  voxel  the  7  delays  FIR  filter  is  on 
the  left,  while  the  FIR  filter  with  extracted  delays  is  on  the  right.  Right:  behaviour 
of  the  7  delay  filter  and  the  extracted  filter  on  the  fMRI  time  series  measured  in 
the  first  voxel. 

DISCUSSION 

The  FIR  example  above  illustrates  the  fact  that  using  a  parcimonious  model, 
with  delays  appropriately  chosen,  is  a  good  way  of  obtaining  good  modelling 
abilities.  This  can  be  of  great  help  when  facing  a  problem  on  which  we 
have  no — or  little — physical  insight.  In  that  context,  the  ERD  is  a  principled 
model-dependent  approach  that  has  the  ability  to  select  the  inputs  that  lead 
to  the  best  expected  prediction  error. 

It  should  also  be  emphasised  that  it  seeks  to  optimise  the  actual  criteria 
of  interest,  i.e.  generalisation  error.  Indeed,  at  the  end  of  the  day  we  are 
interested  in  obtaining  the  best  possible  predictions.  Reconstructing  the 
dynamics  of  a  time  series,  as  suggested  by  the  methods  aimed  at  estimating 
the  embedding  dimension,  is  only  a  way  of  reaching  this  ultimate  goal.  On 
the  contrary,  the  ERD  method  that  we  present  here  tries  to  optimise  the 
relevant  performance  criterion  directly. 

This  has  an  interesting  effect:  by  essence,  the  ERD  method  takes  into 
account  the  fact  that  modelling  is  performed  on  a  limited  amount  of  data. 
On  the  Henon  map  example,  this  leads  to  the  selection  of  an  additional 
delay.  It  has  no  link  to  the  actual  dynamics  of  the  system,  but  gives  a 
clear  decrease  in  error.  Furthermore,  when  the  model  is  not  flexible  enough 
to  implement  the  system  mapping,  we  will  probe  further  into  the  past,  and 
possibly  discover  higher-order  dependencies  that  will  ease  the  modelling.  This 
is  well  illustrated  by  the  two  time  series  examples. 

Another  aspect  of  the  delay  extraction  procedure  as  proposed  here  is  that 
it  relies  on  the  estimation  of  the  generalisation  error.  It  is  expected  that  the 
more  accurate  the  prediction  is,  the  more  relevant  the  delays  selected  will  be. 
It  should  be  noted  however  that  we  are  only  interested  in  finding  minima  of 
the  generalisation  error,  so  an  estimator  will  be  usefull  as  long  as  it  gives  the 
right  “trend”  in  generalisation. 

Lastly,  let  us  recall  that  this  method  is  inspired  from  the  forward  selection 
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methods  in  statistical  variable  selection.  A  natural  extension  of  this  is  the 
use  of  backward  elimination  steps,  in  a  manner  similar  to  stepwise  regression. 
Similarly,  pruning  techniques  can  be  used  to  remove  inputs  of  the  model  that 
are  potentially  harmfull  with  respect  to  generalisation  error. 


SUMMARY 

We  have  presented  a  generalisation-based  method  for  finding  the  relevant 
delays  in  time  series  modelling.  It  relates  to  stepwise  variable  selection  pro¬ 
cedures  in  classical  (linear)  regression.  This  ‘Extraction  of  Relevant  De¬ 
lays’  method  is  straightforward  to  implement  and  leads  to  interesting  results. 
When  the  model  is  not  flexible  enough  to  implement  the  underlying  map¬ 
ping,  it  selects  additional  delays  in  order  to  minimise  estimated  generalisation 
performance.  Noticeably,  it  outperforms  some  non-parametric  methods  for 
determining  the  embedding  dimension  when  applied  to  insufficiently  flexible 
models. 

Directions  for  future  work  include  refinement  of  the  relevance  criterion, 
as  well  as  the  extension  of  this  scheme  to  different  problems  such  as  system 
identification,  with  more  than  one  temporal  inputs. 
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Abstract 

We  show  that  the  decision  function  of  a  ladial  basis  function  (RBF) 
classifier  is  equivalent  in  form  to  the  Bayes-optimal  discriminant  as¬ 
sociated  with  a  special  kind  of  mixture-based  statistical  model.  The 
relevant  mixture  model  is  a  type  of  “mixture  of  experts”  model  for 
which  class  labels,  like  continuous- valued  features,  are  assumed  to  have 
been  generated  randomly,  conditional  on  the  mixture  component  of 
origin.  The  new  interpretation  shows  that  RBF  classifiers  do  effec¬ 
tively  assume  a  probability  model  which,  moreover,  is  easily  determined 
given  the  designed  RBF.  This  interpretation  also  suggests  a  maximum 
likelihood  learning  objective,  as  an  alternative  to  standard  methods, 
for  designing  the  RBF-equivalent  models.  This  statistical  objective  is 
especially  useful  for  incorporating  unlabelled  data  within  learning  to 
enhance  performance.  While  this  approach  might  appear  to  be  lim¬ 
ited  to  applications  involving  a  large,  label-deficient  training  set,  the 
scope  of  application  is  significantly  extended  with  the  observation  that 
any  new  data  to  classify  is  also  unlabelled  data,  available  for  learning. 

Thus,  we  suggest  a  com6tned  learning  and  use  paradigm,  to  be  invoked 
whenever  there  is  new  data  to  classify.  This  new  approach  is  tested 
for  vowel  recognition,  given  a  small  archive  of  examples  from  differ¬ 
ent  speakers.  For  this  problem,  a  conventional  method  is  of  necessity 
speaker-independent.  By  contrast,  combined  learning  and  use  allows 
speaker-dependent  adaptation,  with  resulting  gains  in  performance. 

*Thi8  work  was  supported  in  part  by  National  Science  Foundation  Career  Award  IRI- 
9624870. 
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1  Combined  Learning  and  Use 

In  the  problem  of  statistical  classification,  there  is  generally  a  clear  division 
between  the  design  phase  of  the  classifier  and  the  phases  in  which  the  classi¬ 
fier  is  validated  on  test  data,  or  in  which  the  design  is  used  in  an  application 
setting  to  classify  new  data.  Typically,  one  is  given  a  training  set  of  repre¬ 
sentative  (feature  vector,  class  label)  training  pairs.  This  set  is  all  that  is 
available  for  the  model  design,  and  once  the  design  is  complete,  the  classifier 
is  fixed  for  subsequent  test  set  validation  or  for  use  in  an  application. 

It  is  interesting  to  contrast  this  picture  with  learning  in  other  domains, 
e.g.  adaptive  filtering  methods  in  signal  processing,  where  there  is  no  clear 
division  between  learning  and  use  since  both  occur  simultaneously.  A  second 
example  with  analogies  to  classification  is  the  problem  of  image  segmentation. 
Here,  one  has  a  general  model  for  images,  in  the  form  of  an  energy  function 
that  depends  on  both  continuous-valued  model  parameters  and  on  discrete 
segmentation  (classification)  variables.  The  energy  function  is  minimized  for 
each  new  image  one  wishes  to  segment.  Each  such  minimization  determines 
both  the  continuous  model  parameters  and  the  segmentation  for  the  image. 
Thus,  learning  of  the  model  and  its  use  (segmentation  of  the  given  image  at 
hand)  are  combined  into  one  operation. 

There  are  compelling  reasons  for  considering  whether  this  combined  learn¬ 
ing  and  use  analogy  can  be  extended  to  the  classification  problem.  In  par¬ 
ticular,  in  principle  we  would  like  to  modify  the  classifier  for  new  data  (data 
to  he  classified)  if  the  training  data  was  an  inadequate  representation  of  the 
underlying  data  source,  or  if  the  data  source  is  non-stationary.  The  primary 
difficulty  with  this  objective  lies  in  the  fact  that  nearly  all  learning  approaches 
for  statistical  classification  are  pure  supervised  learning  schemes,  for  which 
a  supervising  datum,  i.e.  the  class  label,  is  required  for  each  feature  vector 
used  in  the  training.  This  statement  is  clearly  true  of  neural  network  learning 
approaches  such  as  back  propagation,  which  minimize  the  squared  distance 
to  target  class  values.  It  is  also  true  of  standard  methods  for  training  parar 
metric  (e.g.  Gaussian  mixture)  classifiers,  wherein  the  data  is  first  divided 
into  classes,  with  maximum  likelihood  estimation  then  applied  to  separately 
estimate  the  class-conditional  densities.  Unfortunately,  new  data  to  classify  is 
inherently  unlabeled,  which  makes  it  incompatible  with  standard  supervised 
learning  approaches.  Moreover,  even  given  the  existence  of  a  learning  method 
which  utilizes  unlabeled  data,  it  is  not  obvious  that  use  of  this  data  would 
be  effective,  without  accompanying  labels,  in  reducing  the  classification  error 
rate.  In  fact,  there  are  results  which  suggest  otherwise  [1]. 

However,  recently,  several  new  schemes  have  been  suggested  for  assimilat- 


103 


ing  unlabeled  data  within  the  learning.  These  methods  have  been  found  to 
be  effective  when  the  amount  of  labeled  training  data  is  inadequate.  In  [11],  a 
method  was  proposed  for  training  classifiers  based  on  mixed  labeled/ unlabeled 
training  sets.  The  authors  suggested  a  statistical  model  for  the  data  naturally 
suited  to  a  maximum  likelihood  learning  scheme  involving  both  the  labeled 
and  unlabeled  data  subsets.  The  unlabeled  subset  essentially  allows  improved 
estimation  of  the  model  parameters,  which  are  then  “plugged  into”  the  Bayes 
decision  rule.  In  [8]  and  [7],  we  built  on  the  previous  work  in  [11],  suggest¬ 
ing  improvements  to  the  classifier  structure  and  to  the  learning  method,  and 
also  introducing  the  concept  of  “combined  learning  and  use”  for  classification. 
First,  we  suggested  a  more  powerful  probability  model  than  that  in  [11]  with 
an  associated  classifier  structure  that,  in  this  paper,  we  show  to  be  equivalent 
to  the  radial  basis  function  (RBF)  classifier  [9].  Unlike  standard  RBFs,  the 
RBF-equi valent  probability  model  is  amenable  to  likelihood-based  training, 
which  is  the  key  to  assimilating  mixed  labeled/unlabeled  training  data  within 
the  learning.  Second,  in  [8]  we  proposed  an  alternative  learning  criterion  to 
that  in  [11],  based  on  the  joint  data  likelihood  over  both  the  labeled  and  un¬ 
labeled  data  subsets^,  which  was  found  to  yield  performance  gains  over  the 
method  in  [11].  Two  distinct  (Expectation-Maximization)  EM  [3]  learning  al¬ 
gorithms  were  derived  for  maximizing  this  likelihood.  These  distinct  methods 
were  obtained  by  viewing  different  data  elements  as  “missing  data”  within 
the  EM  framework  [7].  Finally,  we  made  the  observation  that  test  data,  or 
in  fact  any  new  data  to  classify,  can  be  viewed  as  a  new  unlabeled  data  set 
to  which  the  mixed  training  method  can  be  applied,  to  modify  the  classifier 
prior  to  the  classifier’s  use  on  this  data.  This  is  what  we  called  “combined 
learning  and  use” . 

In  this  work,  we  first  briefly  summarize  the  developments  in  [7],  after  which 
we  show  the  equivalence  between  the  probability  model  introduced  there  and 
the  RBF  classifier.  Next,  we  validate  the  combined  learning  and  use  paradigm 
for  this  classifier  structure  through  an  experiment  involving  vowel  recognition 
given  an  archive  of  examples  from  different  speakers.  In  this  context,  given 
a  limited  training  set,  a  conventional  classification  approach  is  of  necessity 
speaker-independent  By  contrast,  we  will  show  that  combined  learning  and 
use  provides  a  way  to  achieve  speaker- dependent  adaptation  (and  use)  of  the 
model,  with  resulting  gains  in  the  classifier  performance.  Finally,  we  sug¬ 
gest  that  the  combined  learning  and  use  approach  may  also  be  applicable  to 
regression  fitting. 

^  A  conditional  data  likelihood  measure  was  suggested  in  [ll]. 


104 


2  Mixed  Training  for  an  RBF-equivalent  Mix¬ 
ture  Model 


Although  the  classifier  learning  problem  has  been  separately  cast  as  one  of 
i)  directly  estimating  a  posteriori  class  probabilities,  ii)  least  squares  regres¬ 
sion  to  target  class  values  and  iii)  more  directly  minimizing  an  error  count 
measure  [6],  only  the  first  objective  appears  suitable  for  incorporating  un¬ 
labeled  data  within  the  learning.  Moreover,  while  there  are  neural  network 
models  which  directly  produce  the  class  probability  estimates  as  outputs^, 
the  training  for  such  networks  is  also  unsuitable  for  incorporating  unlabeled 
data.  Alternatively,  in  [7],  we  suggested  a  somewhat  unconventional  proba^ 
bility  model.  This  model  is  a  generalization  of  a  standard  mixture  wherein, 
like  the  feature  vectors  x  £11^,  the  class  labels  c  el,  I  the  label  set,  are 
also  assumed  to  have  been  generated  randomly,  conditional  on  the  mixture 
component  of  origin.  More  concretely,  the  data  is  assumed  to  have  been 
generated  in  the  following  way: 


1.  Randomly  select  one  of  M  mixture  components  according  to  the  prob¬ 
ability  mass  function 

2.  Given  the  selected  component,  k,  choose:  a)  a  feature  vector  x  according 
to  the  component  density  f(x]  A*),  where  A*  is  the  parameter  set  of  the 
density,  and  b)  a  class  label  c  according  to  the  conditional  probabilities 

Note  that  usually  mixture  components  are  “hard-partitioned”  to  classes,  i.e. 
Pj\k  €  {0, 1}.  However,  we  have  found  that  allowing  classes  to  probabilisti¬ 
cally  “share”  components  makes  the  learning  less  sensitive  to  initialization 
and  the  solution  less  sensitive  to  the  choice  of  M,  This  model  will  also  be 
motivated  from  a  different  standpoint  shortly. 

The  corresponding  a  posteriori  class  probabilities  take  the  form: 


■P[c|»]  =  '^Pe\t 

k 


(  ^kf{x;Ak)  \ 


(1) 


These  probabilities  have  a  “mixture  of  experts”  structure  [4],  where  the  “gat¬ 
ing  units”  are  the  conditional  probabilities  of  mixture  components  given  fea¬ 
ture  vectors  (in  parentheses),  and  with  the  “expert”  for  component  ib  just  the 

^This  objective  is  also  related  to  estimating  the  a  posteriori  probabilities  [lO]. 

®MLP  structures  with  normalization  in  the  output  layer  can  be  trained  as  probability 
estimators  by  minimizing  a  cross  entropy  criterion. 
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conditional  probability  {/3c|i}-The  associated  Bayes  decision  rule  is 
Sb.r.,(==)  =  argm^ 


(2) 


where  Sb^y„(‘)  is  a  selector  function  with  range  in  I. 

For  the  learning  problem,  we  supposed  the  existence  of  two  data  subsets, 
i.e.  a:  =  {A'jjA'u},  where  A'l  =  {(aJi,ci),  (212,  C2), is  the  labelled 
subset,  with  =  {xN^^u...,x^r}  the  unlabeled  one.  Unlabeled  data  was 
introduced  into  the  learning  to  improve  parameter  estimates  via  the  joint 
data  likelihood  criterion: 

M  M 

1=1  Xi^Xi  1=1 

A  description  of  two  distinct  EM  learning  algorithms  for  maximizing  logi 
can  be  found  in  [7].  Finally,  it  was  recognized  that  any  new  data  set  to 
classify  can  be  viewed  as  a  new  subset  X^.  -  hence  the  data  to  classify  can  be 
used  to  update  the  model  (e.g.  to  specialize  the  model)  prior  to  the  model’s 
use  in  classifying  the  data  -  i.e.,  a  combined  learning  and  use  operation  was 
suggested. 

Note  that  mixed  training  and  “combined  learning  and  use”  only  appear  to 
be  applicable  to  classifier  structures  that  are  based  on  a  probability  model  for 
the  data.  However,  we  now  show  that  this  assumption  is  not  very  restrictive, 
since  the  decision  rule  for  the  probability  model  we  just  described  is  actually 
equivalent  in  form  to  a  commonly  used  neural  network  classifier  that  is  not 
typically  given  a  probabilistic  description.  Consider  a  radial  basis  function 
network  used  in  a  classification  setting  [9].  For  an  lT|-class  problem,  there  is 
one  RBF  output  per  class, 

M 

g-(x)  =  3  =  Ij-'-jPl.  (4) 

jb=l 

Here,  /(•)  is  the  basis  function  which,  without  much  restriction,  we  may  take 
to  be  a  density  function.  Also,  Xkj  is  a  scalar  weight  connecting  basis  and 
class  output  j  The  associated  decision  function,  5  :  77”*  — »  {1, 2, . . .,  |J|}, 
is  the  winner- take-all  rule: 

S(»)  =  a,igm^gj(x).  (5) 


4ln  [9],  an  alternative  normalized  RBF  structure  was  suggested.  For  conveiuence  we 
consider  the  un-nonnaUzed  structure  here,  though  our  equiv^ence  result  holds  for  both 
structures  because  the  normalization  does  not  affect  the  decision  rule. 
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We  can  observe  that  gj(‘)  looks  like  a  mixture  density.  However,  it  is  not 

usually  thought  of  in  this  way,  primarily  because  the  weights  {\kj}  can  be 

positive  or  negative  and  are  trained  for  a  least  squares  criterion  [9],  rather 

than  a  statistical  one  such  as  the  data  likelihood.  Regardless,  we  can  easily 

show  that  the  decision  rule  (5)  is  in  fact  equivalent  to  the  Bayes  rule  in  (2). 

First,  define  Amin  =  minAtj.  Then,  subtracting  (Amin  Z) /(®;  ^*)) 

* 

class  output,  we  obtain  the  equivalent  rule: 

S(a;)  =  argmM^  /(»;  Ajb)Afcj ,  (6) 

where  Xkj  =  ^kj  -  Amin-  Note  that  A*;  >  0.  Next,  we  divide  each  output  by 

the  constant  (  Amn))  fc  obtain 
m,n 

S{x)  =  argm^^/(s;  Ajfc)gifcj,  (7) 


where  •  ^ow,  we  have  0  <  qkj  <  1  and  =  1-  Finally,  we 

2-/  kj 

m,w 

can  normalize  each  transformed  class  output  gj{x)  =  ^k)Qkj  by  the 

Jb 

sum  to  again  yield  the  rule 

J=i 


/(a;;  Ak)qkj 


E/(2.;A*)(^)(E9fcn) 


5(®)  =  argm^ 


r  E  /(®;Am)gn 

m,n 


=  arg  max 


r  E/(®;Am)(E9-n)  ‘ 


Now,  comparing  (2)  with  (8),  it  is  easy  to  see  that  they  are  equivalent.  In 
particular,  we  may  identify  ■  with  ^j\k  and  ^qkn  with  a*.  There  are 

7  . 9kn  « 


two  implications  of  this  result.  First,  we  suggest  an  alternative  probabilistic 
viewpoint  on  RBF  classifiers.  In  fact,  given  a  standard  RBF  solution,  we 
have  shown  that  one  can  easily  find  the  equivalent  probability  model.  While 
the  parameters  of  this  equivalent  model  were  not  obtained  via  maximum 
likelihood  estimation,  the  equivalent  model  may  still  provide  some  insight 
into  the  implicit  statistical  modelling  assumptions  made  by  the  RBF  solution. 
Moreover,  in  some  cases  one  is  interested  both  in  hard  classification  decisions 
and  in  a  probabilistic  assessment  of  class  ownership.  The  RBF-equivalent 
model  directly  provides  this  information  via  the  probabilities  {P[c|a;]}.  The 
second  implication  of  this  equivalence,  and  the  one  of  more  significance  for 
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this  work,  is  that  it  effectively  suggests  new  training  possibilities  for  a  widely 
used  neural  network  classifier.  In  particular,  while  the  standard  RBF  model 
is  not  amenable  to  likelihood-based  training  (and  hence  to  combined  learning 
and  use),  the  RBF-equivalent  model  is.  We  next  consider  one  scenario  where 
combined  learning  and  use,  applied  to  this  RBF-equivalent  model,  yields 
performance  benefits.  Other  results  can  be  found  in  [7]. 


3  Experimental  Results 

In  a  basic  application  of  the  combined  learning  and  use  paradigm,  we  sim¬ 
ply  use  the  new  unlabeled  features  associated  with  the  test  set  to  augment 
an  existing  training  set.  After  training  on  the  combined  data  set  using  the 
scheme  in  [7],  the  new  data  set  is  classified.  As  one  example  in  [7],  we  con¬ 
sidered  the  3-class,  40-dimensional,  5000  sample  waveform— hnoise. data  set 
from  the  UC  Irvine  machine  learning  repository.  There,  it  was  shown  that 
combined  learning  and  use  outperformed  conventional  approaches  for  train¬ 
ing  RBF  classifiers,  which  are  forced  either  to  discard  the  unlabelled  features, 
or  to  make  limited  use  of  them.  Here,  we  apply  this  new  paradigm  to  the 
classification  of  a  speech  archive.  The  archive  includes  examples  of  vowels 
from  different  speakers,  with  the  speaker  identity  known.  While  examples 
of  the  same  vowel  may  share  similar  statistics  across  speakers,  examples  of 
different  vowels  from  the  same  speaker  may  also  share  a  common  statistical 
character.  This  suggests  that  it  may  be  sensible  to  separate  the  data  into 
“speaker-dependent”  batches.  Each  such  batch  to  classify  can  be  taken  as  an 
unlabeled  set  for  combined  learning  and  use,  either  in  concert  with  a  labeled 
training  set  (derived  either  from  the  entire  archive,  or  from  an  independent 
data  set)  or  to  modify  a  previous  design  based  on  such  a  labeled  training  set. 
The  success  of  this  scheme  rests  with  the  potential  for  adapting  the  classifier 
to  each  data  batch,  based  on  the  unlabeled  batch  features. 

We  have  tested  this  idea  on  Deterding’s  vowel.dat  set,  consisting  of  LPC- 
derived  log  area  ratios,  representing  eleven  different  vowels.  In  this  set,  there 
are  six  examples  of  each  vowel  from  each  of  fifteen  different  speakers  (990 
examples  in  all).  We  used  two  examples  of  each  vowel  from  each  speaker  as 
the  labeled  training  set  (330  samples  in  all)  with  the  remaining  four  examples 
from  each  speaker  used  as  the  test  set.  Note  that  the  data  set  is  too  small  to 
design  separate  classifiers  for  each  speaker  in  a  conventional  way.  We  chose  to 
compare  two  different  combined  learning  approaches,  along  with  the  method 
from  [9]  (denoted  MD-RBF).  In  the  first  scheme  {speaker-independent  (SI)), 
the  test  set  of  660  samples  was  viewed  collectively  as  A'u  and  used  in  con¬ 
cert  with  for  combined  learning.  We  then  classified  In  the  second 
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method  [speaker- dependent  (SD)),  the  forty-four  test  set  samples  from  each 
speaker  were  viewed  as  distinct  data  batches.  We  thus  performed  combined 
learning  separately  for  each  speaker,  based  on  and  the  speaker-specific 
test  set.  Each  batch  was  then  classified  based  on  its  speaker-specific  model. 
Note  the  tradeoff  between  the  two  designs:  the  speaker-specific  scheme  de¬ 
signs  each  of  the  fifteen  classifiers  using  only  the  330  labeled  samples  and 
44  speaker-specific  unlabeled  samples,  while  the  speaker-independent  scheme 
uses  the  entire  data  set  for  its  single  design.  However,  this  latter  classifier 
must  perform  well  for  all  speakers,  rather  than  just  a  single  one.  Results  were 
obtained  for  models  of  size  11,  22,  and  33  mixture  components.  The  perfor¬ 
mance  measure  chosen  was  the  average  test  set  error  fraction,  computed  based 
on  different  choices  for  the  parameter  initialization  and  for  the  data  subset 
realization.  We  designed  classifiers  for  all  possible  (fifteen)  realizations  of  the 
training  and  test  sets.  For  each  realization,  for  each  model  size,  20  classi¬ 
fiers  were  designed  based  on  random  parameter  initialization,  with  the  test 
set  performance  averaged  over  the  300  solutions.  As  shown  in  Table  1,  the 
speaker-specific  scheme  provides  a  consistent  performance  advantage  over  the 
speaker-independent  one®,  with  both  methods  outperforming  WfD-RBF. 


Model  Size 

Method 

Train-err/Test-err 

11  components 

SD 

0.58/0.58 

SI 

0.61/0.61 

MD-RBF 

0.63/0.64 

22  components 

SD 

0.45/0.46 

SI 

0.49/0.50 

MD-RBF 

0.50/0.52 

33  components 

SD 

0.37/0.38 

SI 

0.41/0.42 

MD-RBF 

0.44/0.44 

Table  1:  Average  test  set  misclassification  error  fraction  for  Deterding’s 
vowel.dat  set.  Results  for  two  “combined  learning  and  use  approaches” 
(speaker-independent  (SI))  and  (speaker-dependent  (SD))  are  shown,  along 
with  MD-RBF. 


®The  high  error  rates  observed  for  this  experiment  are  consistent  with  prior  results 
reported  for  this  data  set. 
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4  Combined  Learning  and  Regression  Mod¬ 
elling 

There  is  a  natural  way  to  extend  the  combmed  learning  and  use  approach  de¬ 
veloped  heretofore  to  address  the  “other”  basic  supervised  learning  problem  - 
regression.  For  this  problem,  the  data  pairs  now  take  the  form  (xti  yi)}  where 
yi  €  is  the  regression  output  with  now  called  the  input.  Combined 
learning  and  regression  fitting  can  be  accomplished  by  modifying  existing 
learning  approaches  that  are  already  likelihood-based,  to  incorporate  an  ad¬ 
ditional  unlabeled  term.  Jacobs  and  Jordan  [4]  and  Jordan  et  al.  [5]  suggested 
“mixture  of  experts”  regression  structures  naturally  suited  for  learning  pro¬ 
cedures  based  on  maximum  likelihood  estimation.  W^e  can  pose  the  learning 
problem  as  maximizing  the  conditional  likelihood  logf{y\X)  as  in  [5],  or  the 
joint  likelihood  f{y,X)  as  in  [12].  For  combined  learning  and  use  appli¬ 
cation,  we  suppose  that  there  is  now  an  “unlabeled”  ncw/tcst  input  data  set, 
for  which  we  want  to  estimate  the  outputs.  Using  “1”  to  denote  the  “labeled 
subset,  we  can  now  alternatively  maximize  either  log  /( \Xi)  -f  log  /(Afy)  or 
log/(jI?i,  Xi)  +  log/(A'u).  We  are  currently  investigating  combined  learning 
and  regression  as  outlined  here.  It  is  important  to  note  that  combined  learn¬ 
ing  and  use  may  have  a  smaller  range  of  application  for  regression  than  for 
classification.  In  particular,  for  certain  regression  problems  such  as  time  se¬ 
ries  prediction,  the  actual  outputs  are  made  available  as  time  unfolds  —  hence, 
this  problem  lacks  a  large,  output-deficient  set  on  which  to  apply  combined 
learning  and  use.  One  likely  application,  however,  is  the  problem  of  restoring 
images  observed  through  a  noisy,  distorting  medium.  Cha  and  Kassam  [2] 
used  a  model  and  learning  approach  similar  to  that  in  [4]  and  [5],  with  the 
learning  performed  over  a  training  set  of  distorted  images.  We  believe  that 
combined  learning  and  regression  may  be  effective  in  this  setting. 
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Abstract  -  We  address  the  problem  of  autonomous  decision  making  in 
classification  of  radioastronomy  transient  signals  on  spectrograms  from 
spacecraft.  It  is  known  [10]  that  the  assessment  of  the  decision  process 
can  be  divided  into  acceptation  of  the  classification,  instant  rejection  of 
the  current  signal  classification,  or  rejection  of  the  entire  classifier  model. 
We  propose  to  combine  prediction  and  classification  with  a  double  archi¬ 
tecture  of  Neural  Networks  to  optimize  a  decision  while  minimizing  the 
false  alarm  risk.  We  suggest  a  method  to  derive  the  input  and  output 
windows  for  the  predictor  network.  Results  on  real  data  from  URAP 
experiment  aboard  Ulysses  spacecraft  show  that  this  scheme  is  tractable 
and  effective.  Keywords:  spectrogram  classification,  radioastronomy, 
neural  network  classification. 


1  INTRODUCTION 

As  transient  astronomical  signal  detection,  classification,  and  tracking  from 
space  is  becoming  a  classic  application  for  neural  networks  [4],  in  the  literature 
though,  measuring  the  risk  of  classification  or  detection  decision  is  not  always 
taken  into  account  as  it  is  for  radar  or  sonar.  Classification  risk  minimization 
has  been  studied  by  Bishop  [2]  and  Schurmann  [6],  they  suggest  a  Bayesian 
analysis  of  confidence,  while  Lippmann  [7]  uses  a  two  networks  approach  based 
on  bootstrapping  to  predict  the  risk  of  classification.  However,  low  frequency 
radio  planetary  emissions  are  non  stationary,  non  Gaussian  and  non  linear. 
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For  such  signals,  classification  risk  assessment  is  not  obvious. 

A  solution  has  recently  been  suggested  by  de  LASSUS  &  al.  [5]  with  two 
neural  networks  working  in  parallel,  thus,  only  smooth  signals  were  consid¬ 
ered.  We  propose  here  a  more  general  approach  enabling  false  alarm  risk 
reduction  whatever  the  signal  time  frequency  distribution,  provided  that  the 
signal  is  correlated  on  the  time  frequency  plane  to  some  extent.  The  solu¬ 
tion  we  suggest  is  based  on  neural  classification  of  the  signal  confirmed  by  a 
neural  prediction.  With  this  method,  it  is  possible  to  punctually  reject  the 
classifier’s  choice,  or  even,  if  necessary,  to  decide  that  the  current  classifier 
becomes  unreliable  and  has  to  be  changed.  The  example  we  show  belongs  to 
low  frequency  radio  astronomy,  but  this  approach  seems  applicable  in  many 
other  fields  of  interest  where  geophysical  signals  are  under  study. 


2  CLASSIFICATION  OF  SPECTROGRAMS 

Classification  of  low  frequency  planetary  radio  bursts  displayed  in  the  time 
frequency  plane  is  a  task  similar  to  the  well  known  ’’cocktail  party”  problem: 
identify  many  sources  emitting  in  the  same  time.  The  signals  under  study 
originate  from  different  parts  of  the  solar  system,  and  the  distances,  as  well 
as  the  observation  angles,  are  heterogeneous  and  changing  fast.  This  leads 
to  moving  patterns  on  the  spectrogram  while  we  want  to  classify  the  signal. 
Automatic  tracking  and  classification  of  non  Gaussian,  nonlinear  transient 
radio  planetary  signal  on  spectrogram,  using  a  single  detector  is  a  difficult 
task.  When  the  omni  directional  detector  is  moving  fast  across  the  sources 
and  when  the  patterns  have  an  anisotropic  distribution  along  the  trajectory 
of  the  detector,  abrupt  changes  occur  on  the  spectrogram.  An  accurate  classi¬ 
fication  may  be  cast  into  more  simple  subtasks  :  identify  the  sources  number, 
find  the  main  features  of  the  sources,  and  classify  the  signal.  If  the  classifi¬ 
cation  has  to  be  done  on  spectrogram  representation  of  the  signal,  the  Time 
Delay  Neural  Network  (TDNN)  architecture  has  proved  to  be  a  convenient 
choice  [4].  A  classifier  would  typically  have  a  20  x  12  input,  a  1  x  3  output, 
and  1240  parameters  for  9609  learning  samples.  On  each  frequency  channel,  a 
TDNN  classifier  is  moved  from  one  energy  peak  of  the  signal  to  another.  An 
input  window  is  cut  on  the  spectrogram  around  the  peak.  The  TDNN  output 
gives  the  label  (class  of  signal)  of  the  current  energy  peak  on  the  spectrogram. 
Typically,  the  method  yields  a  rate  of  success  of  up  to  90%,  challenging  human 
expert  visual  recognition.  Thus,  since  the  number  of  sources  may  change,  as 
well  as  their  patterns,  the  classifier  may  be  confronted  to  situations  unseen 
before.  In  such  cases,  there  is  a  need  to  assess  the  quality  of  the  decision,  and 
to  evaluate  the  ability  of  the  classifier  to  carry  on  its  mission  successfully. 


The  rejection  dilemma.  Let  di  be  the  decision  of  accepting  a  classifier’s 
suggestion,  and  do  the  rejection  of  this  proposal.  Let  Hi  be  the  hypothesis 
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that  the  signal  belongs  to  the  label  suggested  by  the  classifier,  and  Hq  hy¬ 
pothesis  that  it  does  not  belong  to  this  label.  Then,  we  have  [10]:  choose 
di  when  Hi  (correct  classification);  choose  di  when  Hq  (false  alarm);  choose 
do  when  Hi  (undue  rejection);  choose  do  when  Hq  (justified  rejection).  If 
the  TDNN  output  number  is  strictly  designed  according  to  the  number  of 
known  classes  of  signal,  the  risk  of  doHo  and  doHi  is  limited  to  the  set  of 
outputs  of  the  neural  network,  and  the  classifier  is  forced  to  decide  between 
the  available  classes  [8].  This  is  a  good  solution  when  the  number  of  classes 
is  computable  by  direction  finding  together  with  clustering  on  the  cumulant 
domain,  or  on  a  wavelet  packet  base  [3].  But  even  so,  we  still  have  to  deal 
with  diHo  .  In  order  to  reduce  the  risk  of  diHo  without  loosing  too  much 
on  di  Hi ,  one  would  intuitively  try  to  use  different  thresholds  on  the  output 
of  the  classifier,  but  this  is  inappropriate  if  the  probability  densities  of  each 
signal  class  in  the  learning  data  base  are  equal.  The  solution  that  we  suggest 
is  rejection  by  prediction:  that  is  to  predict  the  signal  outside  of  the  TDNN 
classifier  input.  If  this  short  term  prediction  is  confirmed  by  future  samples, 
then  the  postponed  decision  is  confirmed. 


3  REJECTION  BY  PREDICTION 

Principles.  In  order  to  minimize  the  classification  risk  diHo,  we  need  a 
measure  of  quality  to  motivate  our  decision.  We  suggest  to  use  two  neural 
networks  (NN)  in  parallel,  a  classifier  and  a  predictor,  for  each  frequency 
channel.  Inputs  to  these  NNs  will  be  subsets  of  the  same  window  cut  on  the 
spectrogram.  One  NN  will  classify  the  signal  with  the  information  available 
from  the  window.  The  other  NN  will  predict  the  signal  expected  to  come 
just  outside  the  window  in  the  very  near  future.  When  the  predicted  signal 
is  there,  it  is  confronted  with  the  prediction.  The  distance  from  the  predic¬ 
tion  to  the  real  signal  yields  the  measure  of  quality  we  needed  to  confirm 
the  decision  of  the  classifier.  In  order  to  cover  a  sufficient  area  of  the  time 
frequency  plane,  we  decided  to  predict  the  future  signal  at  five  different  lo¬ 
cations  (time  frequency  coefficients).  This  number  of  five  is  not  fundamental 
and  can  be  modified.  We  show  now  how  positive  autocorrelation  lags  yield 
the  customized  architecture  for  the  predictor  NN  we  need. 


Designing  a  predictor.  The  problem  is  to  choose  the  right  window  and 
the  suitable  time  step  for  the  prediction.  It  is  known  [1]  that  the  optimal 
input  for  our  classifier  is  determined  in  the  ambiguity  plane  by  the  first  zero 
crossings  of  the  2D-autocorrelation  function  of  the  signal  spectrogram.  The 
input  window  of  a  classifier  TDNN  (noted  H  in  the  following)  has  been  opti¬ 
mized  in  order  to  include  all  types  of  signal  (wide  band  smooth,  wide  band 
bursty,  and  band  limited  signals  respectively  noted  5i,  So,  Ss)-  We  want 
for  each  class  S,-  i  =  1 . .  .3  to  determine  a  smaller  window  (7^)  included  in 
7c  window,  and  five  points  Oi  f  =  1  •  -  •  5  outside  Ip  where  prediction  will  be 
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Figure  3:  Typical  Ulysses  Spectrogram  with  4  different  types  of  signals  overlapping 
Data  Analysis.  For  our  experiment  with  real  data,  we  used  eight  months 
of  data  obtained  by  the  URAP  experiment  aboard  Ulysses  spacecraft  [9] 
(September  1991  -  April  1992).  In  these  data,  radio  bursts  from  the  Sun  and 
Jupiter  are  continuously  present.  Since  Ulysses  flew  by  Jupiter  on  Feb  8**, 
1992  and  changed  its  trajectory  plane  by  Jupiter  gravity  assist,  the  morphol¬ 
ogy  of  the  highly  directive  radio  emissions  from  Jupiter  changed  abruptly. 
Radio  spectra  were  acquired  every  144  sec;  they  are  made  of  16  frequency 
channels  logarithmically  spaced  from  10  to  1000  kHz.  Signal  intensities  were 
normalized  between  (-1,-1-1).  From  September  to  February,  four  kinds  of  ra¬ 
dio  emissions  are  present,  two  smooth  signals  (Solar  Type  III,  Jovian  nKOM) 
and  two  bursty  signals  (Jovian  hOM  and  bKOM),  then  from  February  the  8^* 
1992  until  April,  the  two  bursty  signals  (Jovian  hOM  and  bKOM)  disappear 
abruptly  and  a  new  bursty  signal  is  present  (Kom).  The  two  smooth  signals 
are  present  on  the  entire  data  set. 


Experiment.  We  selected  a  learning  set  of  9609  events  from  September  to 
December  1991,  and  a  test  set  of  486  events  collected  during  these  months, 
but  not  the  same  days.  The  test  set  was  used  to  stop  learning  early  enough 
to  keep  the  generalization  properties  of  the  NNs.  A  validation  set  of  951 
events  was  then  selected  from  January  to  February  8^*,  1992.  We  tried  our 
method  on  the  lOOkHz  frequency  channel,  which  lies  at  the  center  of  the 
spectrogram,  because  there,  the  task  was  most  difficult  with  four  overlapping 
signals.  In  fact,  any  other  frequency  could  have  been  chosen.  For  the  Type 
III  signal,  the  constructive  algorithm  gave  a  window  9x2  and  five  points  in 
the  neighborhood  of  the  window.  So  we  derived  an  MLP  with  9x2  inputs, 
5  outputs,  and  222  parameters.  We  predicted  the  signal  at  the  five  chosen 
locations  where  autocorrelation  was  maximum  outside  the  input  window.  In 
the  last  layer  of  the  multi  layer  perceptron,  we  omitted  the  sigmoidal  mapping, 
so  that  garbage  detection  criteria  can  easily  be  derived  from  observing  the 
values  of  the  outputs  [6] .  A  threshold  is  calculated  from  statistics  [fi  and  a) 
of  the  prediction  error  on  the  learning  data  set.  A  classified  sample  is  rejected 
when  its  prediction  error  is  higher  than  the  threshold.  Similarly  we  derived 
a  predictor  MLP  for  the  nKOM  signal  with  an  input  window  7x2,  and  5 
outputs.  For  the  bKOM  signal  the  best  window  was  very  small,  2  x  2  so  we 
derived  an  MLP  with  a  2  x  2  input,  5  outputs  and  30  parameters. 
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Figure  4:  Percentage  of  diHo  risk  reduction  (X  axis)  against  percentage  of  diH\ 
loss  (Y  axis),  (a)  Type  III,  (b)  bKOM. 

Results  and  Discussion.  The  two  plots  in  figure  4  show  the  predictor  op¬ 
erating  characteristic  curve  for  two  different  types  of  signals,  plotting  diHa 
risk  percentage  against  diHi  risk  percentage.  The  nearer  the  curve  ap¬ 
proaches  the  upper  left  corner  of  the  figure,  the  better  the  results.  Per¬ 
formance  can  be  computed  by  integration  of  this  curve.  On  figure  4  (b) 
the  first  four  curves  of  the  bKOM  list  are  plotted.  It  is  noticeable  that  the 
constructive  algorithm  gives  us  the  windows  in  an  ordered  list  of  decreasing 
efficiency.  The  lower  performance  on  nKOM  (not  shown  on  the  figures)  has 
an  explanation.  nKOM  signal  is  seldom  seen  and  the  training  database  does 
not  contain  as  many  nKOMs  as  needed  to  reach  the  same  standard  as  other 
signals.  These  results  indicate  that  point  wise  rejection  of  diH^  is  possi¬ 
ble  provided  that  the  input  window  is  optimized  for  the  signal  under  study. 
Moreover  it  demonstrates  that  peak  correlation  is  a  key  factor  for  prediction. 
We  will  see  now  another  interesting  result:  the  possibility  of  model  rejection 
when  a  new  signal  is  received. 


4  MODEL  REJECTION 

Principles.  The  idea  is  to  use  the  MLP  predictors  and  the  TDNN  classifier 
jointly,  to  assess  the  ability  of  the  classifier  to  pursue  its  task.  It  means 
that  if  the  signals  present  in  the  learning  set  have  disappeared,  and  have 
been  replaced  by  new  signals,  learning  has  to  be  resumed,  and  the  classifier 
changed.  A  constrained  classifier  alone  gives  no  information  on  its  ability 
to  carry  on  its  task.  If  the  prediction  error  of  the  successive  current  MLP 
predictor  is  kept  in  memory,  then  its  standard  deviation,  calculated  on  a 
moving  window,  gives  information  on  the  presence  of  a  wave  form  that  was 
not  present  in  the  learning  data  set.  A  suitable  error  bar  calculated  from  the 
learning  data  will  tell  if  the  classifier  has  to  be  changed. 
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made.  Theoretical  results  [1]  imply  that  the  size  of  Ip  has  to  be  different 
for  each  class  Si ,  depending  upon  the  stationarity  of  the  signal  measured  by 
its  autocorrelation.  Figures  l(a)(b)  show  the  different  autocorrelations  for 
two  types  of  signal:  a  smooth  signal  (solar  Type  III),  and  a  bursty  signal 
(Jovian  bKOM).  Classification  and  prediction  will  only  be  possible  where  iso¬ 
lines  show  positive  autocorrelation,  i.e.  on  dark  areas,  which  are  varying  from 
one  signal  to  another.  The  classifier’s  input  can  be  centered  on  the  energy 
peak,  the  size  of  the  window  being  adjusted  to  the  mean  zero  crossing.  This 
choice  is  not  possible  for  the  predictor  because  some  room  has  to  be  left  for 
the  predicted  output  somewhere  on  the  highest  part  of  the  autocorrelation 
function.  This  leads  to  an  iterative  optimization  process  to  choose  the  best 
compromise  between  input  and  output  settings  and  sizes. 


(a)  (b) 


Figure  1:  Time  frequency  autocorrelation  of  two  signals.  Dark  areas  show  strong 
correlation  with  the  signal  energy  peak,  (a)  Type  III,  (b)  bKOM. 

Specification  for  the  windows.  Considering  the  type  of  signals  we  were 
studying  we  decided  to  predict  its  behavior  on  five  different  locations  chosen 
outside  of  the  input  window  where  autocorrelation  is  at  its  highest  level. 
Doing  so  we  can  be  assured  that  prediction  is  done  precisely  where  the  signal  is 
locally  quasi  stationary,  so  that  our  chance  of  success  is  maximum.  The  input 
window  must  be  rectangular  to  be  suitable  for  the  TDNN  classifier.  According 
to  our  experiments,  it  seems  that  TDNNs  networks  are  more  eflScient  than 
MLP  for  classification  while  the  latter  is  better  for  prediction.  We  found  also 
that  MLP  is  somewhat  more  sensible  to  noise  than  TDNN,  when  predicting. 
However,  we  assigned  a  multi  layer  perceptron  (MLP)  for  the  prediction  task. 
As  a  specific  predictor  NN  has  to  be  derived  for  each  type  of  signal,  we  suggest 
a  general  method  to  choose  the  input  and  output  layers  of  the  predictor  NN. 


Derivation  of  window  size  and  location.  The  general  idea  is  to  adjust 
the  input  layer  as  close  as  possible  to  the  unknown  matched  filter.  Set  a 
small  candidate  input  window  {Ip)  (e.g.  of  size  2x2)  for  the  predictor  over 
the  energy  peak  at  a  given  frequency.  Compute  Ck  (correlation  between  the 
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window  and  the  peak  for  signal  Sk)  • 

N  M 

j=i  i=i 

where  aj^  is  the  mean  standard  deviation  of  Ip  windows,  N  is  the  number  of 
samples  in  the  learning  database  of  the  same  signal  Sk,  Mis  the  size  of  the 
window,  Xj  is  the  peak,  is  the  mean  energy  of  the  window  around  peak 
Xj.  Try  all  possible  positions  for  this  window  around  the  peak.  Enlarge  the 
window  and  iterate  the  process  as  long  as  Ck  is  positive.  Select  the  window 
which  maximizes  Ck  computed  for  the  k  class  over  the  learning  database, 
and  then  determine  the  related  location  of  the  desired  prediction  (points 
Oi  i  =  Choose  these  points  outside  of  the  input  window,  where 

correlation  is  highest.  The  output  of  the  predictor  NN  will  have  to  predict 
these  points  (one  output  per  point  to  predict).  The  result  can  be  seen  on 
figure  2,  for  two  types  of  signal.  It  gives  a  mapping  of  autocorrelation  of  all 
possible  windows  from  size  1  x  1  to  size  32  x  16.  The  areas  in  grey  levels 
show  efficient  window  sizes.  The  best  window  is  given  by  the  darkest  point 
of  the  figure.  Y  size  is  read  vertically,  x  size  is  read  horizontally.  For  smooth 
wide  band  signals  (Solar  Type  III),  good  windows  are  2500  seconds  long  and 
2  frequency  channels  high.  For  bursty  wide  band  signals  (Jovian  bKOM) 
efficient  windows  can  be  found  in  a  small  area  around  300  seconds  large 
and  two  frequency  channels  high.  Whereas  for  band  limited  signals  (Jovian 
nKOM  not  represented  here),  relevant  windows  can  be  found  along  a  narrow 
band  of  two  frequency  channels  of  2000  seconds  long.  With  our  data  base, 
convergence  was  almost  immediate. 

The  next  step  is  to  learn  the  prediction,  test  the  predictor  with  a  test  set, 
and  choose  a  rejection  error  bar  according  to  the  performances  on  the  test 
set.  Eventually  validating  the  system  on  a  validation  set. 


Figure  2;  Mapping  of  Ip  window  autocorrelation  according  to  its  size.  Window  time 
width  on  X  axis,  frequency  height  on  Y  axis.  Dark  areas  show  strong  correlations, 
(a)  Type  III,  (b)  bKOM. 
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Experiment.  For  this  experiment  we  chose  the  data  from  September  1991 
to  April  1992.  On  the  8*^  of  February  the  bKOM/hOM  disappears  and  is 
replaced  by  Kom,  and  from  this  date,  the  classifier  is  unable  to  operate.  This 


Figure  5:  Prediction  error  Figure  6:  Classification  error 

is  obvious  from  the  peak  starting  from  February  on  figure  6  which  illustrates 
classification  error  calculated  for  the  period.  Figure  5  shows  the  prediction 
error  during  the  same  period.  The  appearance  of  the  new  signal  is  accurately 
detected  within  a  few  minutes.  Spurious  peaks  after  February  indicate  clearly 
that  the  classifier  is  obsolete.  Detailed  analysis  of  the  results  show  that  the 
valleys  in  the  prediction  error  are  correlated  with  those  of  the  classification 
error.  They  correspond  to  the  presence  of  recognized  signals  (Type  III).  These 
signals  are  jammed  by  a  new  Kom  signal  never  seen  in  the  learning  data  base. 


5  CONCLUSION 

We  have  proposed  a  method  to  manage  rejection  in  classification  of  spec¬ 
trogram  of  low  frequency  radio  astronomy  signals.  We  have  shown  that  a 
predictor  MLP  and  a  classifier  TDNN  can  be  used  in  parallel  to  minimize  the 
false  alarm  risk.  Moreover,  this  scheme  can  be  used  to  decide  if  the  classifier 
has  to  be  changed  when  the  environment  has  evolved.  This  capacity  is  of 
crucial  importance  for  space  radioastronomy  detectors  which  have  no  direc¬ 
tivity.  We  have  shown  how  to  derive  the  best  input  and  output  windows  for 
the  predictor  neural  network.  This  method  gives  us  a  series  of  windows  of 
decreasing  efficiency  and  enhances  the  fact  that  ambiguity  plane  correlation 
is  a  key  factor  for  transient  Bayesian  prediction. 
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Abstract 

In  this  paper,  we  propose  some  theories  regarding  the  dynamical  system 
representational  capabilities  of  recurrent  neural  networks  with  real-valued 
inputs  and  outputs.  It  is  shown  that  multiple  nonlinear  dynamic  systems 
can  be  approximated  within  a  single  nonlinear  model  structure.  A  relation¬ 
ship  is  identified  between  this  class  of  recurrent  network,  hybrid  models  and 
agent  based  systems. 


1  Introduction 

Recurrent  neural  networks  are  very  general  models  and  have  been  proven  to  of¬ 
fer  significant  computational  capabilities  such  as  Turing  equivalence  with  linear 
slowdown  [18].  There  has  been  some  interest  in  applying  recurrent  networks  to 
dynamical  systems  and  control  problems.  It  has  been  shown  that  such  models 
possess  universal  dynamic  approximation  capabilities  [9, 19]. 

Recently,  it  has  been  shown  by  Feldkamp  and  Puskorius  [8],  that  a  class  of 
recurrent  networks  can  learn  to  model  several  dynamical  systems  and  switch  be¬ 
tween  them,  depending  on  the  characteristics  of  the  input  signal.  Instead  of 
learning  to  model  just  system,  the  network  was  able  to  learn  several  models 
with  very  different  pva^^es. 

This  type  of  mod  ^^^^jhenomenon  is  not  unique  to  the  above  example  how¬ 
ever.  There  appear  to  ue  a  number  of  interrelated  methods  in  the  literature  which 
have  been  studied  mostly  independently,  see  for  example:  [1,2,4, 10-12, 16, 17]. 

In  this  paper,  we  examine  this  phenomema  further,  and  propose  a  theory  which 
explains  how  recurrent  neural  networks  can  possess  the  capability  of  modelling  a 
number  of  dynamical  systems  simultaneously.  Examples  are  also  given. 
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2  Preliminciries 

Function  approximation  is  the  task  of  approximating  a  mapping  given  by  the 
function  Fi  :  'll”'  -t  U”,  where  H  is  the  usual  Euclidean  space.  Functional 
approximation  is  the  task  of  approximating  a  mapping  F2  ■  C(K)  -t  ^  where 
C  (K)  is  a  Banach  space  of  continuous  functions  on  a  compact  set  K,  defined 
by  the  norm  ||/||c(x)  =  max^eK  [/(j:)!-  The  implication  here,  is  that  we  have  a 
mapping  F2  which  maps  the  past  inputs'  to  the  current  output  (i.e.  a  variable 
^  ^  7?.)  [3].  Operator  approximation  is  the  task  of  approxiating  a  mapping  Fs  . 
C  (K)  C  (K) .  Thus,  in  this  case,  we  seek  to  approximate  a  mapping  of  an  input 
sequence  x{t)  G  C  {!<)  to  an  output  sequence  y{t)  G  C  (K) . 

We  term  Co  {Ko)  the  space  of  operators  on  a  compact  set  Ko-  The  mapping 
from  one  operator  space  Co  (Ko)  to  another  operator  space  Co  (Ko)  is  called,  for 
lack  of  a  better  name,  an  operational  map. 


Definition  1  An  operational  map  H  is  defined  by 

H  :  Co  (Ko)  ^  Co  (Ko) 

Definition  2  A  recurrent  network  (RNN)  is  defined  by 

x(^  +  l)  =  f(AxW  +  Bu(t)  +  EyW) 
y(t)  =  Cx(0  +  Du(t) 


(1) 

(2) 

(3) 


where  x(<)  =  [a:(t)x(t-l)...*(<-n..)]’'  is  an  n(n^  +  l)  x  1  vector,  u(t)  =  [«(()«{<- 

l)...u(t  -  ,  y(<  -  1)  =  [!/(<  -  -  %)]''  >  f  1  v^tor-valued 

function  with  sigmoid  elements,  typically  defined  ««  {/(f))  - 
:^n(n:,+l)xn(n.+l)^  B  G  7^«(”-+l)  ,  C  G  7^P^vXn(n.+l)  ^  ^  ^pn^xm(n„  +  l) 

and  E  G  The  bias  terms  are  not  explicitly  shown,  but  are  included 

within  u(t)  as  a  fixed-value  input. 

This  characterization  of  a  recurrent  network  gives  a  general  framework  from 
which  many  well  known  structures  can  be  derived,  e.g.  [18-20]. 

Definitions  A  multiple  model  At  i(5i,^i)  with  fixed  structure  Si  and  parameter 
vector  Oi  exhibits  a  set  of  characteristic  properties  Vi,  where 


Vi  = 


Vii  *  =  1,2,.... 


Viv  ven. 


discrete  multiple  model 
continuous  multiple  model 


(4) 


3  Multiple  Model  Representational  Using  Oper¬ 
ational  Maps 

A  multiple  model  G  can  be  characterized  as  an  operator,  which  itself,  consists 
of  multiple  operators  F.  Consider  a  multiple  model  operator  A  :  x  -¥  y,  x  & 

1 A  sequence  r(t),  which  we  may  sample  at  discrete  points  t  =  0,1,...  is  given  by  - 

[x  (0)  ,a:(l),  •  •  •]*  and  is  a  function  in  C  (K). 
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X  CC  {!<) ,  y  e  Y  C  C  (K)  describing  the  input-output  functional  relationship. 
Additionally,  A  contains  an  operator  F  subject  to,  for  example,  E  :  Fa  ^  Fb 
Fa  eda  C  Co  (A'o),  FbeSb  C  Co  (Ao).  Now,  instead  of  considering  A  directly, 
we  are  interested  in  the  existence  of,  and  mechanisms  by  which  there  may  arise 
mappings  of  the  form  H  :  Fi  ^  Fi+i,  where  i  =  1, 2, ...  is  the  index  of  the  discrete 
functionals  in  the  discrete  multiple  model  case,  ov  H  :  Fa  ^  Fb  corresponding  to 
the  continuous  multiple  rriodel  case.  In  each  case,  H  CCa  (Ko)  where  Csi  (Ao)  is 
the  space  of  operational  maps. 

Hence,  in  addressing  the  issue  of  multiple  models,  this  implies  that  we  seek 
to  answer  the  following  question:  ‘Is  it  possible  to  find  a  nonlinear  model  which 
approximates  an  operational  map  H  :  Co  (A^)  — )■  Co  (Ko)  ?’  We  consider  this 
question  below. 

The  existence  of  a  universal  general  operational  map  is  made  clear  by  the 
following  theorem. 

Theorem  1  An  operational  map  H  given  by 

Fe,x(t)  I — H(F)  :  x(t)  y(t)  (5) 

can  be  obtained  by  the  interconnection  of  a  parameterized  operator  F$  :  C  (A) 

C  (A) ,  and  functional  Me  :  C  (A)  — >■  according  to 

y  =  F»(x-,0)  (6) 

$  =  Mc{x)  0  =  0ot<O  (7) 

where  x(t),  y(t)  e  C  (K)  are  continuous,  real-valued  input  and  output  functions 
respectively  oftE  11+,  0  is  the  m-dimensional  parameter  vector  of  Fe  and  9o  is 
the  initial  parameter  vector. 

Proof  Sketch.  Let  there  exist  an  operator  Fe  determined  by  the  parameter 
vector  0  and  a  functional  map  Me  capable  of  universal  approximation  in  the  sense 
of  ,  for  example,  [15].  For  the  fth  parameter  within  a  given  model  Fe,  we  have 

Si  -  BiZi  (8) 

Now,  let  Me  be  a  functional.  Therefore,  for  any  input  sequence  x  e  X,  Me  results 
in  any  desired  set  of  parameters  B  C  ,  such  that 

Si  =  Mei(x)Zi  (9) 

=  Bzi  (10) 

where  Me  =  [Mei  •  •  •  MemV  is  a  vector  function.  Thus,  the  existence  of  Me  : 
u  -+  B  permits  any  arbitrary  Fe  to  be  obtained  due  to  the  mapping  Me.  The 
operational  map  H  =  {Fe,Me}  is  general  in  the  sense  that  it  is  capable  of  the 
same  approximation  as  Fe,  but  can  be  varied  arbitrarily  for  any  sequence  x. 
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4  Universal  Operational  Maps 

From  Theorem  1,  we  can  obtain  the  following  result. 

Theorem  2  (Universal  Operational  Map-I)  There  exists  a  parameterized  model 
G{0)  :  x(t)  y{t),  where  x  £  X  CCo  (Ko)  and  y  EY  C  Co  (Xo),  such  that 

|y(^)  -yWI  <  ^  (11) 

where  Fff>,x(t)  i — >  H(F)  :  x{t)  y{t)  and  e  >  0. 

Proof  Sketch.  Let  Fe>  be  a  parameterized  model  capable  of  universally  approx¬ 
imating  any  operator  (e.g.  a  Chen  network  [7]).  Let  Me  be  a  time  delay  neural 
network  or  other  structure  having  universal  functional  approximation  character¬ 
istics  [6, 15].  Then  from  Theorem  1,  there  exists  a  neural  network  G  defined 
collectively  by  F^'  and  Me,  which  can  approximate,  arbitrarily  closely,  some  op¬ 
erational  map  H  =  Me},  where  H  :  Co  {Xo)  Co  (Xo). 

Remarks 

1.  The  implication  of  the  theorem  is  that  every  weight  in  F  is  replaced  by  an 
additional  network  Mci-  This  provides  the  means  of  approximating  non¬ 
linear  operational  maps.  As  noted  earlier,  related  approaches  have  been 
considered  in  the  literature  (see,  for  example,  [16,17]). 

2.  It  is  possible  to  introduce  any  required  type  of  model  for  Me.  Thus,  a  hybrid 
model  can  be  elegantly  obtained. 

3.  The  model  G  is  a  sigma-pi  network,  but  can  also  be  interpreted  as  a  modular 
structure. 

4.  A  more  general  form  for  Me  is  s  =  Mei(x,u).  In  this  case,  the  output  from 
Me  is  not  used  as  a  parameter,  but  receives  the  previous  parameter  input  u 
and  gives  the  previous  output  s  after  the  parameter.  Hence,  we  can  derive 
the  following  related  theorem. 

Theorem  3  (Universal  Operational  Map-II)  A  universal  operational  map¬ 
ping  Fe,x[t)  I — y  H{F)  :  x{t)  — )■  y(t)  is  given  by  the  interconnection  of  a  universal 
operator  F  :  (A,  V)  —yY  and  a  single-input  single-output  universal  function  map 
Mel  :  X  according  to 

y  =  (12) 

V  =  Mci(x)  (13) 

where  x  e  X  CC(K)  ,v  &  V  CC{K) ,  y  £Y  CC(K) ,  and  0  &  <d  CTl’" . 

Proof  Sketch.  The  proof  follows  directly  from  Theorem  1.  The  operator  F{x,  v) 
is  a  universal  approximator  as  independent  as  required  in  each  of  its  inputs. 
Therefore,  for  every  distinct  value  of  v,  e.g.  v  —  1,2,...  a  distinct  universal 
approximation  F  :  C  {X)  C  (X)  may  be  obtained.  Let  Mc\  prescribe  some 
operator  from  the  input  x  to  the  extra  input  v.  Therefore,  a  distinct  universal 
approximation  F  can  be  obtained  as  required  for  any  given  input  x,  hence  a 
universal  operational  model  is  obtained. 
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Remarks 


1.  In  the  above  theorems ,  for  universal  operational  maps,  it  may  be  assumed 
that  F  is  given  by  a  network  as  described  in  [7]. 

2.  F  may  also  be  given  by  a  recurrent  neural  network,  with  the  universal 
approximation  properties  in  the  sense  of  those  shown  by  Sontag  [19].  In  this 
case,  the  resulting  recurrent  network  H,  possesses  also  multiple  modelling 
capabilities^. 

These  results  offer  a  different  viewpoint  for  modular  networks.  Previously, 
modular  neural  networks  were  used  to  approximate  a  mapping  by  spatial  decom¬ 
position  [13].  Here,  operational  maps  perform  a  temporal  decomposition. 

The  discrete  multiple  model  can  be  interpreted  as  a  hybrid  system  [5].  More¬ 
over,  a  neural  network  may  arbitrarily  form  such  hybrid  systems,  even  in  the 
course  of  training,  without  the  user  necessarily  being  aware  of  this  phenomena 
occurring. 

Discrete  multiple  models  are  related  to  agent  systems  [2,14].  The  models  F 
and  Me  can  each  be  viewed  as  agents  cooperating  together  to  produce  a  more 
complex  mapping  than  either  is  capable  of  acting  alone.  The  framework  proposed 
here  also  includes  models  such  as  mixtures  of  experts  [10]. 


5  Synthesis  of  Multiple  Models  by  Bias  Shifting 

Here  we  present  a  simple  constructive  approach  to  show  how  multiple  models  may 
be  synthesized  in  both  feedforward  and  recurrent  neural  networks.  The  approach 
we  propose  is  well  suited  for  multiple  models  which  are  comprised  of  a  set  of 
discrete  F  models  and  is  applicable  to  synthesis  as  well  as  analysis,  though  the 
latter  case  is  not  discussed  here. 


Theorem  4  An  MLP  model  G{x^  v)  can  form  multiple  unique  function  mappings 


defined  by 


G{x,v)  =  my{x),  V  =  1,  ...p 

(14) 

N  /  m  \ 

{x)  =  Y,Cvig  + 

<=i  Vj=i  / 

(15) 

where  the  extra  input  v  indexes  the  desired  mapping  my(x). 

2 Though  to  be  precise,  one  may  wish  to  qualify  the  sense  of  the  approximation  in  terms  of 
the  specific  characteristics  of  universal  approximation  being  performed,  i.e.  in  the  sense  defined 
by  Chen  and  Chen  [7]  or  Sontag  [19]. 
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Proof  Sketch.  Without  loss  of  generality,  let  N  =  np.  This  specifies  p  subgroups 
within  G,  each  of  n  hidden  units.  Therefore  we  have 

N  /  m  \ 

G(x,v)  =  +  +  h  =  l,...p  (16) 

9hi  =  'pM-rh  (17) 

For  sufficiently  large  r,  if  we  consider  approximations  on  the  range  [a,  6]  where 
r,  then  setting  v  =  {1,  ..,p}  results  in 

N  /  m  \ 

G{X,V)  =  +  j  (1^) 

=  mv(x)  (19) 


as  required. 

Theorem  5  An  RNN  model  G{x,y,v)  can  form  multiple  unique  mappings 

G(x,y,v)  =  my(x,y),  v  =  l,...p  (20) 

where  the  extra  input  v  indexes  the  desired  mapping  my(x^y). 

Proof  Sketch.  Omitted  due  to  lack  of  space.  The  proof  follows  a  similar  proce¬ 
dure  to  that  used  for  Theorem  4. 

The  implication  of  this  theorem  is  that  by  appropriate  biases  offsets  in  the 
different  groups  of  units,  various  units  in  the  recurrent  network  can  be  “pushed” 
in  and  out  of  action.  Note  that  this  method  can  be  used  as  a  means  of  switching 
between  different  discrete  models,  or  in  a  continuous  sense,  by  setting  r  to  an 
appropriately  small  value. 


6  Examples 

In  this  section,  we  give  a  number  of  examples,  which  indicate  the  idea  of  the 
multiple  models  discussed  in  the  paper.  In  order  to  clarify  the  results,  we  use 
simple  model  structures. 

6.1  General  Examples 

Here  we  consider  some  general  examples  to  indicate  some  possible  types  of  mul¬ 
tiple  models  which  may  be  synthesized. 

1.  Input  Amplitude  Dependent  Model. 

This  type  of  model  is  derived  by  using  a  characteristic  function  Me  given 
by 

{OoX  X^  ^  To 

Oix  ro  <X^  ^  ri  (21) 

O2X  x^  >  r2 
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where  {r}  are  scalar  values.  The  model  behaviour  is  dependent  on  the 
instantaneous  value  of  the  input  In  particular,  the  model  will  switch 
parameter  sets  when  the  amplitude  of  x^(t)  crosses  certain  thresholds.  This 
model  can  be  considered  as  a  multiple  bilinear  or  state-dependent  model. 

2.  Frequency  Dependent  Model  Selection. 

Me  may  be  a  function  of  the  frequency  of  the  incoming  signal.  For  example, 

Me(x)  =  fm  (X(uj))  (22) 

where  fm  is  a  function  of  X(u;)  =  FFT[x]. 

3.  Sequential  Model  Selection 

Here,  the  model  is  based  on  some  predetermined  time-sequence  and  bears 
a  close  relationship  with  hybrid  systems  considered  in  robotics  [5].  In  this 
case,  the  input  a?  is  a  sequence  of  symbolic  binary  values.  Mc{x)  processes 
this  symbolic  binary  input  and  upon  recognition  of  a  particular  sequence, 
outputs  a  1  and  holds  it  for  a  specified  period  of  time,  otherwise  the  output 
is  a  zero. 

6,2  A  Recurrent  Network  Multiple  Model 

6.2.1  Network  Architecture 

Based  on  the  multiple  model  framework  presented  in  this  paper,  a  recurrent  net¬ 
work  multiple  modeP  can  be  proposed  as  follows. 

^n~m  [q  -  (upd))  “  9pci  («/?«))  ^ 

^  nr=l  ~  9ari  ('Wari))  0^  =  1  Sari  (War*))  ~  Sotri  (^ari)) 

(23) 

where  {^r}  are  the  element-wise  characteristic  functions  corresponding  to  Me.  {u} 
are  ancillary  inputs.  Since  the  poles  and  zeros  in  (23)  are  the  outputs  of  nonlinear 
functions,  they  are  termed  nonlinear  poles  and  zeros  respectively  and  the  model 
can  also  be  considered  as  a  nonlinear  pole-zero  model.  The  coefficient  functions 
{p}  and  ancillary  inputsju}  provide  a  wide  scope  for  introducing  a  variety  of 
models. 


6.2.2  Example:  An  Input  Amplitude-Dependent  Multiple  Model 

A  nonlinear  pole-zero  multiple  model  is  synthesized  in  this  example  as  follows. 
The  model  is  described  by 

"  (l-Gi  («)?-')  (i-g;(»)9-') 

Gi{u)  =  aiu(t) a2u(t)  Gl(u)  =  (t) -\- a^u  (t) 

^For  convenience  and  clarity  of  the  example,  we  have  chosen  to  use  linear  models  as  a  basis, 
however  these  may  be  arbitrary  nonlinear  models  in  practice.  Since  the  behaviour  of  linear 
systems  is  well  known,  it  is  easier  to  visualize  the  differences  between  the  two  models  in  this 
case. 
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Figure  1:  The  performance  of  a  recurrent  neural  network  multiple  model  In  this  simple 
example,  (a)  is  the  input  signal,  and  (b)  shows  the  model  output  due  to  the  input. 


where  Gi  (u)  and  Gl{u)  provide  complex  conjugate  outputs  and  Hm{q)  switches 
between  two  underlying  linear  models  Hi{q),  H2{q),  depending  on  the  binary 
ancillary  input  signal  u  (t) . 


Hi(q) 


_ 1 _ 

1  +  anq-'^  +  0:2?“^ 


(25) 

(26) 


where,  for  the  purposes  of  this  example,  we  choose  an  =  1.6,  an  =  0.73,  021  = 
-1.9,  022  =  0.925  and  a,-  are  the  corresponding  first  order  poles  of  Hi{q).  The 
input  'u(t)  can  be  obtained,  for  example,  by 

u{t)  =  T{E,[x^{i)])  (27) 

where  Es  [•]  denotes  the  short  term  expectation  and 

rw.{;  III  m 

Therefore,  when  the  magnitude  of  the  short  term  average  of  the  squared  input 

signal  reaches  a  certain  threshold,  defined  by  F,  the  model  will  change.  The 

performance  of  this  model  is  shown  in  Fig.  1,  where  the  model  characteristics  due 
to  the  change  in  input  can  be  seen. 


6.2.3  Example:  An  RNN  Modelling  a  Time-varying  Linear  System 

This  example  we  synthesize  a  recurrent  neural  network  model  which  models  a 
time- varying  linear  system,  described  by 

=  i-Ci 

The  model  varies  as  a  function  of  the  ancillary  input  signal  u,  and  the  coefficient 
functions  Q  (t) ,  z  =  1,2  are  in  this  example,  simple  linear  functions,  given  by 


Ci{t)  =  CiO  +  CtiCf(0 

/  _  biou{t)  +  6ii'»(t  —  1) 

”  l  +  aii^~^  +  ai2g“^ 


(30) 

(31) 
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Figure  2:  The  time- varying  recurrent  network  described  in  Section  6.2.3  is  capable  of 
exhibiting  a  variety  of  different  behaviours  as  observed  here. 


For  the  purposes  of  this  example,  we  choose  the  parameters^  aii  =  1.6,  ai2  =  0.9, 
021  —  ~l-9,  0,22  —  0.925,  610  =  1.0,  611  =  0.5,  620  =  0.9,  621  =  0.5,  cio  =  1.75, 
Cii  =  0.04,  C20  =  0.8,  C21  =  0.02  and  u(i)  =  x(i). 

The  resulting  multiple  model,  which  is  an  extension  of  the  usual  bilinear  struc¬ 
ture,  can  be  described  by  the  difference  equation  form  of  nonlinear  pole-zero  model 
given  by 


p(t)  =  x(t)-j-Ci(t)y(t-l)-hC2W?/(i-^)  (32) 

(.(t)  =  ca  +  Ci2(6o^p(i)  +  6i^(0-aiiC(^-l)4-ai2Ci(^-2))  (33) 

where  x(t)  is  the  input  and  y(t)  is  the  output.  Examples  of  the  model  behav¬ 
iour  are  shown  in  Fig.  2,  which  indicate  some  of  the  Tichness’  of  the  model’s 
capabilities. 

7  Conclusions 

In  this  paper,  we  have  given  theories  which  indicate  how  nonlinear  models,  in¬ 
cluding  feedforward  and  recurrent  networks,  can  approximate  systems  known  as 
multiple  models.  We  have  shown  that  such  models  can  be  considered  in  terms  of 
operational  maps.  It  was  shown  that  there  exist  classes  of  neural  networks  which 
can  universally  approximate  operational  maps.  The  results  provide  an  explana¬ 
tion  for  the  experimental  behaviour  of  some  recurrent  networks  in  being  able  to 
model  multiple  dynamic  systems  [8]. 
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Abstract 

Additional  inputs  to  a  feed-forward  network,  derived 
from  the  output  of  the  hidden  layer  neurons,  allow  a 
feed-forward  network  to  deal  with  temporal  pattern  recog¬ 
nition  and  reproduction  tasks.  These  ’network  derived’ 
or  ’context’  inputs  augment  the  ’true’  inputs  to  the  net¬ 
work  and  allow  the  network  to  retain  past  information 
necessary  for  temporal  sequence  processing.  The  choice 
of  which  hidden  neurons  to  retain  to  provide  the  con¬ 
text  inputs  is  difficult.  Use  of  all  the  hidden  neurons  in¬ 
creases  the  size  of  the  overall  network  resulting  in  poorer 
generalization  performance.  The  problem  is  complicated 
due  to  difficulty  in  choosing  the  number  of  hidden  layer 
neurons  in  the  first  place.  In  this  paper,  we  propose 
the  use  of  regularization  terms  in  the  sum-of-squared  er¬ 
ror  cost  function.  Assuming  the  hidden  layer  neurons 
are  indexed  1, 2, . . . ,  m,  the  regularization  terms  force  the 
differentiation  of  hidden  neurons  1  through  mj,  and  m2 
through  m  (where  1  <  mi  <  m2  <  m).  Both  mi  and  m2 
are  controllable  and  allow  fringe  neurons  to  be  used  to 
provide  the  context  inputs  if  the  number  of  context  units 
to  use  is  known.  When  the  number  of  context  neurons 
to  use  cannot  be  determined,  the  regularization  terms 
minimize  mi ,  and  maximize  m2 ,  while  hidden  neurons  mi 
through  m2  are  penalized  for  differentiation.  An  ampli¬ 
tude  detection  simulation  is  used  to  evaluate  the  efficacy 
of  the  proposed  paradigm. 


0-7803-4256-9/97/$  10.00  ©1 997  IEEE 
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I  Introduction 

Neural  network  models  for  temporal  sequence  processing  can  be  broadly  clas¬ 
sified  into  [1],[2]: 

1.  Tapped  delay  line  models:  The  network  has  past  inputs  explicitly 
available  (through  a  tapped  delay  line)  to  determine  its  response 
at  a  given  point  in  time  (see  for  example  [3]).  Thus  the  temporal 
pattern  is  converted  to  a  spatial  pattern  which  can  then  be  learned 
through,  say,  classic  back-propagation  [5]. 

2.  Context  models  or  Partial  recurrent  models:  These  models  retain  the 
past  output  of  neurons  instead  of  retaining  the  past  raw  inputs.  For 
example,  the  output  of  the  hidden  layer  neurons  of  a  feed-forward 
network  can  be  used  as  inputs  to  the  network  along  with  the  true 
inputs  6].  These  ’network  derived’  inputs  are  also  called  context 
inputs.  When  the  interconnections  carrying  the  context  inputs  are 
fixed,  classic  back-propagation  can  be  used  for  training  the  network. 
More  complex  variations  of  this  basic  idea  include  self-feedback  in 
the  context  inputs  or  deriving  the  context  inputs  from  other  locations 
in  the  network  [3], [7]. 

3.  Fully  recurrent  models:  These  models  employ  full  feedback  and  in¬ 
terconnections  between  all  units  [8]-[10]. 

While  tapped  delay  line  models  are  the  simplest  to  use  and  have  been  ap¬ 
plied  to  such  tasks  as  speech  recognition  (see  for  example  [11]),  one  has  to 
use  a  tapped  delay  line  of  length  equal  to  the  longest  possible  sequence  to 
accommodate  all  the  sequences.  This  increases  the  input  dimensionality,  and 
consequently  the  size  of  the  network,  requiring  much  more  training  data  [12]. 
Alternatively,  a  larger  network  gives  poorer  generalization  performance  as 
compai’ed  to  a  smaller  sized  network  when  both  networks  are  trained  on  the 
same  amount  of  data.  Fully  recurrent  models,  whicli  lie  on  the  other  end  of 
the  spectrum,  are  the  most  flexible  but  suffer  from  large  time  and  storage 
requirements  [1].  Context  models  lie  somewhere  between  the  simplicity  of 
a  tapped  delay  line  model  and  the  power  of  a  fully  recurrent  network.  For 
many  sequence  processing  tasks  they  provide  competitive  solutions.  It  has 
also  been  shown  that  such  context  models  can  approximate  the  behavior  of 
a  finite  state  automaton  [13]. 

In  the  actual  implementation  of  context  models,  one  is  faced  with  the  dif¬ 
ficulty  of  selecting  the  hidden  layer  neurons  which  will  provide  the  context 
inputs.  One  can  obviously  use  the  entire  hidden  layer  but  this  results  in  an 
increase  of  the  dimensionality  of  the  augmented  input  space,  thereby  requir¬ 
ing  additional  training  data.  The  problem  is  further  complicated  since  the 
number  of  hidden  layer  neurons  are  not  known  in  the  first  place.  This  paper 
is  directed  towards  obtaining  better  generalization  performance  despite  these 
difficulties. 

Our  approacli  is  based  on  trying  to  localize  the  hidden  neurons  which  dif¬ 
ferentiate  while  forcing  the  non-differentiating  hidden  layer  neurons  to  have 
minimal  activation.  The  localization  of  the  hidden  layer  neurons  is  forced 
to  appear  at  the  fringes  i.e.  assuming  the  hidden  layer  neurons  are  indexed 
1, 2, . . .  ,m,  the  regularization  terms  force  the  differentiation  of  hidden  neu- 
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rons  1  through  mi,  and  m2  through  m  (where  1  <  m-i  <  m2  <  m).  Both  mi 
and  m2  are  controllable  and  allow  fringe  neurons  to  be  used  to  provide  the 
context  inputs  if  the  number  of  context  units  to  use  is  known.  When  the  num¬ 
ber  of  context  neurons  to  use  cannot  be  determined,  the  regularization  terms 
minimize  mi,  and  mciximize  m2,  while  hidden  neurons  mi  through  m2  are 
penalized  for  differentiation.  We  attempt  to  enure  this  behavior  through  the 
use  of  regularization  terms  in  the  familiar  sum-of-squared  error  cost  function. 

The  rest  of  this  paper  is  organized  as  follows.  In  the  next  section,  we  derive 
an  algorithm  that  induces  specialization  of  the  fringe  hidden  layer  neurons. 
In  section  3,  we  illustrate  the  efficacy  of  the  proposed  model  with  simulations. 
In  section  4,  we  present  our  conclusions. 


II  Induced  Specialization  of  Context  Units 

We  consider  a  standard  feed-forward  architecture  with  n  inputs,  a  single 
hidden  layer  with  m  neurons,  and  an  output  layer  with  o  neurons.  The 
network  operates  synchronously  and  in  a  layered  manner  —  i.e.  all  neurons 
in  a  layer  are  simultaneously  updated,  and  then  the  next  layer  is  updated 
and  so  on.  The  n  inputs  to  the  network  consist  of  the  ’true’  inputs  (a?),  and 
the  context  inputs  which  are  the  immediate  past  output  of  the  fringe  hidden 
layer  neurons  copied  using  one-to-one  non-modifiable  connections  (see  Figure 
1).  Since  the  contextual  feedback  is  always  derived  from  the  fringe  hidden 
layer  neurons,  our  approach  is  to  train  the  network  using  a  cost  function 
which  includes,  besides  the  sum-of-squared  error  term,  regularization  terms 
which  force  the  fringe  hidden  layer  neurons  to  differentiate. 

We  denote  the  network  training  data  as  consisting  of  p  input-desired  output 
pairs  {(^^,  C^)}*  is  thus  the  augmented  input  vector  comprising  of  the 
true  inputs  and  the  context  inputs.  The  weight  from  input  k  to  hidden  layer 
neuron  j  is  denoted  by  wjk  and  the  weight  from  hidden  neuron  j  to  output 
neuron  i  is  denoted  by  Wij. 

The  output  of  a  hidden  layer  neuron  on  input  pattern  /i,  is  a  non-linear 
function  of  its  net  input  (h^)  i.e, 

=  f  =  f  (1) 

Similarly,  the  output  of  an  output  layer  neuron  (?/f ),  on  input  pattern  p,  is 
a  non-linear  (or  linear)  function  of  its  net  input  (sf)  i.e, 

(2) 

To  induce  specialization  of  the  context  units  we  introduce  regularization 
terms  in  the  sum-of-squared  error  formulation  of  the  cost  function.  Thus 
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Output  Layer 


Figure  1:  The  architecture  of  the  proposed  network.  The  context  inputs  are  de¬ 
rived  from  the  fringe  hidden  layer  neurons  which  are  made  to 

a  regularization  term  added  to  the  traditional  sum-of-squared  error  cost  Action. 
Weights  with  hollow  arrow-heads  are  fixed  (at  1)  and  serve  only  to  produce  a  c  py 
of  the  activation  from  source  to  destination. 


we  define  a  cost  function  as: 


P  O  o 

^  fi=i  [1=1 


+  ^26^ 


where,  A,  and  Aj  are  Lagrange  multipliers,  and  0  is  explained  shortly.  Regu¬ 
larization  is  introduced  in  (3)  through  the  use  of  two  terms  which  we  explain 
below. 

The  first  of  these  terms  consists  of  two  components.  The  first  component, 
)\  is  a  Gaussian  whicli  achieves  its  maximum  at  m/2  i.e.  achieves 
a  maximum  for  hidden  layer  neurons  in  the  middle  and  falls  o^  at  a  rate 
governed  by  cr,  as  we  approach  the  fringe  hidden  layer  neurons.  The  second 

component  of  this  first  term,  zf  is  the  square  of  the  activation  of  a  hidden 
laver  neuron  j.  The  product  of*  the  two  components  ensures  that  neurons 
activations  in  the  middle  of  the  hidden  layer  (i.e.  neurons  with  indices  near 
m/2)  are  penalized.  Consequently  the  fringe  hidden  layer  neurons  are  made 
to  specialize  through  this  term  allowing  for  contextual  feedback  to  be  drawn 
from  them. 


The  second  regularization  term,  tries  to  make  o  approach  9.  »  should 

be  chosen  such  that  the  Gaussian  (in  the  first  regularization  term)  has  de- 
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cayed  to  a  small  value  as  one  approaches  the  fringe  hidden  layer  neurons 
selected  for  providing  the  contextual  inputs.  Consequently,  the  value  of  6 
chosen  reflects  how  many  contextual  inputs  are  required.  Once  the  number 
of  contextual  inputs  have  been  decided,  one  can  simply  choose  them  to  be 
the  fringe  neurons.  Alternatively,  one  can  use  all  the  hidden  layer  neurons  to 
provide  the  context  inputs  while  using  6  =  m/2.  This  choice  allows  cr  to  be 
adapted  to  be  the  largest  possible,  or  conversely,  as  few  hidden  layer  neurons 
as  possible  to  differentiate.  The  remaining  hidden  layer  neurons  which  are 
fed  back  as  context  inputs  do  not  differentiate  (i.e.  behave  similarly)  and 
consequently  does  not  impede  generalization  performance. 

Thus,  the  cost  function  in  (3)  forces  the  minimization  of  the  SSE  under  the 
constraint  that:  (i)  the  chosen  fringe  hidden  layer  neurons  differentiate  the 
most,  or  (ii)  when  all  the  hidden  layer  neurons  provide  the  context  inputs, 
then  as  many  of  them  of  them  behave  similarly  as  possible. 

We  now  proceed  to  derive  the  update  equations.  Since  the  context  inputs  are 
copied  from  the  output  of  the  hidden  layer  neurons  through  non-mo  diflable 
weights,  the  update  equations  are  easily  obtained  from  p)  by  performing 
gradient  descent  in  the  weight  space,  and  adjusting  the  weights  proportional 
to  the  negative  of  the  gradient.  We  thus  obtain  for  the  output  to  hidden 
weights: 


-  ^sWij 

=  V  E  [(cr  -  =  V  E  (4) 

fi=l  /x=l 

where,  ==  r?  is  a  constant  of  proportionality  referred 

to  as  the  learning  rate. 

Similarly,  the  weight  update  equations  for  the  hidden  to  input  weights  are: 


AtWjfc  =  -7? 


dJ 


dwjk 


= 


n=i  I  \t=l 


Finally,  the  update  equation  for  a  is: 


A 

p 

=  -^E 

/i=i 


i=i 


(5) 
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(6) 


-  e)] 

where  constants  have  been  absorbed  into  the  Lagrange  multipliers.  The 
training  consists  of  choosing  the  number  of  context  inputs.  If  the  dioice 
can  be  made,  then  the  context  inputs  should  be  derived  from  the  fringes. 
If  the  dioice  cannot  be  made,  then  one  can  use  $  =  m/2  and  use  all  the 
hidden  neurons  to  provide  the  context  inputs.  Patterns  are  then  continually 
presented  in  sequence,  and  parameters  are  updated  (in  batch  or  pattern 
mode)  according  to  equations  (4)-(6)  until  the  error  is  within  the  desired 
tolerance.  To  prevent  large  numbers  it  is  best  to  scale  a  and  6  such  that  they 
are  less  than  or  equal  to  1.  In  the  next  section  we  present  a  simulation  to 
illustrate  the  efficacy  of  the  proposed  approach. 


Ill  Simulation 

The  simulation  we  present  deals  with  amplitude  detection  in  a  signal  formed 
by  concatenating  sinusoids  of  fixed  frequency  but  varying  amplitude  The 
entire  data  set  is  shown  in  Figure  1.  A  sine  wave  of  four  different  ampli¬ 
tudes  (1.0,  2.0,  1.6,  and  1.2)  are  used  to  generate  the  data  with  20  points 
sampled  from  each  amplitude.  We  use  the  first  40  points  (corresponding  to 
an  amplitude  of  1  and  2)  for  training  and  reserve  the  remaining  40  points 
(corresponding  to  amplitudes  of  1.6  and  1.2)  for  generalization.  Observe  that 
it  is  not  possible  to  estimate  the  amplitude  of  the  underlying  sine  wave  at  a 
given  point  in  time  without  examining  more  than  one  successive  sample. 

We  considered  a  network  of  1  (true)  input,  15  sigmoidal  hidden  neurons,  and 
1  output.  We  considered  two  distinct  situations.  In  the  first  situation,  we 
used  a  total  of  4  context  inputs  (2  drawn  from  the  left  and  2  drawn  from  the 
right  fringe).  Consequently,  6  was  selected  so  that  the  Gaussian  in  equation 
(3)  would  decay  to  a  near  zero  value  at  the  2“"^  neuron  from  the  left  and  right. 
This  simulation  thus  simulates  the  condition  where  the  number  of  context 
neurons  to  use  has  been  determined.  In  the  second  situation  we  used  all 
the  hidden  neurons  to  provide  the  context  inputs  simulating  the  condition 
where  the  number  of  context  neurons  to  use  are  unknown.  We  refer  ^  a 
context  network  trained  with  back-propagation  as  a  Context  Network  (CN) 
and  a  context  network  trained  with  the  proposed  Induced  Specialization  as 
(CNIS).  For  all  cases,  we  trained  the  network  using  the  first  40  points  to  the 
same  sum-of-squared  error  (0.5).  Figure  3  shows  the  results  when  4  hidden 
layer  neurons  (2  each  from  the  left  and  right  fringe)  are  used  for  providing 
the  context  inputs  while  Figure  4  shows  the  results  when  all  the  hidden  layer 
neurons  are  used  to  provide  the  context  inputs.  It  is  clear  that  in  both  cases 
CNIS  performs  better.  Table  I  to  quantifies  the  differences. 

^This  problem  is  given  as  a  demonstration  of  a  standard  Elman  network  (called 
CN  in  this  paper)  in  the  neural  network  toolbox  of  Matlab.  Though  the  results 
in  this  paper  are  generated  from  our  own  code,  we  were  inspired  to  use  it  since 
the  network  has  to  decide  on  an  amplitude  that  it  was  never  trained  with.  It  thus 
serves  as  a  good  test  of  generalization. 
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Figure  2:  The  complete  data  used  for  the  simulations,  (a)  shows  the  input  while 
(b)  shows  the  desired  output.  The  desired  output  of  the  network  at  a  point  in  time 
is  is  the  amplitude  sine  wave  at  that  time.  The  first  40  points  are  used  for  training. 
The  remaining  40  points  are  used  for  obtaining  the  generalization  performance. 
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Figure  3:  Performance  comparison  of  the  standard  context  network,  and  the  pro¬ 
posed  context  network  with  induced  specialization.  A  network  of  1  true  input,  15 
sigmoidal  hidden  layer  neurons,  and  1  linear  output  is  used.  For  the  standard  con¬ 
text  network  (CN)  and  the  context  network  with  induced  specialization  (CNIS),  4 
hidden  layer  neurons  (2  each  from  the  left  and  right  fringe)  are  used  to  provide  the 
context  input.  Both  networks  are  trained  to  the  same  error  (0.5)  on  the  first  40 
points. 
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Figure  4:  Performance  comparison  of  the  standard  context  network,  and^  the  pro¬ 
posed  context  network  with  induced  specialization.  A  network  of  1  true  input,  15 
sigmoidal  hidden  layer  neurons,  and  1  linear  output  is  used.  For  the  standard  con¬ 
text  network  (CN)  and  the  context  network  with  induced  specialization  (CNIS),  all 
15  hidden  layer  neurons  provide  context  input.  Both  networks  are  trained  to  the 
same  error  (0.5)  on  the  first  40  points. 


#  Context  Inputs 

CK~ 

CNIS" 

- i 

15 

1.6466 

0.7227 

1.1512 

Table:  I:  Sum-of-squared  error  on  generalization.  The  network  used  in  all  cases  had 
1  ’true’  input,  15  sigmoidal  hidden  layer  neurons,  and  1  linear  output  neuron.  All 
networks  were  trained  to  the  same  error  (0.5)  on  the  first  40  points. 
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IV  Conclusion 


We  proposed  the  use  of  regularization  terms  in  the  standard  sum-of-squared 
error  function  to  allow  for  fringe  hidden  layer  neurons  to  differentmte,  while 
penalizing  the  differentiation  of  the  neurons  in  the  middle  of  th^e  hidden 
layer.  This  induced  specialization  allowed  contextual  feedback  to  be  ^ways 
drawn  from  the  fringe  hidden  layer  neurons  allowing  for  a  network  with  bet¬ 
ter  generalization  properties.  Initial  results  indicate  improved  generalization 
performance  with  the  proposed  paradigm.  While  approaches  such  as  pruning 
(see  for  example  [14])  can  be  used  to  obtain  better  generalization  there  are 
two  points  in  favor  of  the  proposed  paradigm: 

•  Pruning  approaches  typically  estimate  the  sensitivity  of  the  error  to 
a  weight  and  discard  the  weight  if  the  output  error  does  not  depend 
on  the  particular  weight.  Consequently,  composite  effects  of  weight 
removals  are  not  considered.  Since  the  entire  network  is  trained  in 
the  proposed  paradigm,  composite  effects  are  included. 

•  One  often  requires  retraining  after  pruning  a  weight  (though  meth¬ 
ods  to  distribute  the  role  of  the  weight  to  be  removed  have  been 
proposed). 

Of  course,  there  appears  to  be  no  reason  why  pruning  approaches  could  not 
be  applied  after  training  with  the  proposed  paradigm. 
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Abstract 

This  work  studies  some  of  the  approximating  properties  of  feedforward 
neural  networks  as  a  function  of  the  number  of  nodes.  Two  cases  are 
considered:  sigmoidal  and  radial  basis  function  networks.  Bounds  for  the 
approximation  error  are  given.  The  methods  through  which  we  arrive  at 
the  bounds  are  constructive.  The  error  studied  is  the  Loo  or  sup  error. 

1  STATEMENT  OF  THE  PROBLEM 

Let  X  e  M” .  Feed-forward  sigmoidal  neural  networks  compute 

*=1 

where  F  :  E  ->•  E  is  sigmoidal  and 

n 

Qiix)  =  ~ 

i=i 

The  approximation  properties  of  these  neural  networks  as  a  function  of  N  are 
of  theoretical  and  practical  interest. 

Another  interesting  type  of  network  is  the  so-called  radial  basis  function 
neural  network,  in  which  case  (1)  is  replaced  with 

TV 

Y,CiG{hi(x)), 

i=l 


where 

hi{x)  =  \\x-Xi\\2  {xi  G  E”). 

In  this  case,  the  set  of  possible  choices  for  G  includes  the  Gaussian  function 


2^ 


2 
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but  also 


\/N^+c, 


y/W^ 


X  ,  X 


The  purpose  of  this  paper  is  to  study  the  following  nonlinear  approximation 
problems. 


Problem  1  Consider  the  function 


N 

a{x)  =  Y^CiF{gi{x)), 
i=l 

where  F  is  sigmoidal  and  gi  :  K”  M  is 

n 

9ii^)  = 

J=i 

The  function  clearly  depends  on  a  number  of  parameters  (the  coefficients  Ci,  gij 
and  ai).  Given  the  function  f  :  D  CW  the  problem  is  to  select  N  and 

then  the  coefficients  in  such  a  way  that  the  Lqo  error 

e  -  sup  \f(x)  -  a(x)| 

xeD 

is  small  We  wish  to  study  the  dependence  of  e  upon  N. 

Problem  2  Consider  the  function 

N 

a{x)  =  ^  CiG(hi{x)), 

i=l 

where  G  is  Gaussian  and  hi  :  E”  E  is 

hi{x)  =  ||x-Xi||2. 

The  function  clearly  depends  on  a  number  of  parameters  (the  coefficients  Ci,  the 
parameters  of  the  Gaussian  G,  the  vectors  Xi).  Given  the  function  f  :  D  (Z 
E”  E,  the  problem  is  to  select  N  and  then  the  parameters  in  such  a  way  that 
the  Loo  error 

e  =  sup  \f{x)  -  a(x)| 
xeD 

is  small  We  wish  to  study  the  dependence  of  e  upon  N. 

The  equivalent  one-dimensional  problems,  although  considerably  simpler,  are 
also  of  interest  (primarily  because  they  suggest  methods  that  might  work  for 
the  general  case). 

Problem  3  Given  f  :  [0, 1]  E,  study  the  Loo  error  associated  with  the  ap¬ 
proximation 

N 

fix)  «  Ci  FigiX  -  ai), 

i=i 


where  F  is  sigmoidal 
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Problem  4  Given  f  :  [0, 1]  E,  study  the  Loo  error  associated  with  the  ap¬ 


proximation 


N 

/(^) 

t=l 


where  G  is  the  Gaussian 


2  RELATED  WORK 

A  recent  overview  of  neural  networks  and  their  relevance  to  signal  processing 
applications  can  be  found  in  [7].  For  a  study  of  radial  basis  neural  networks 
see  [9] . 

The  approximation  properties  of  superpositions  of  sigmoidal  functions  were 
studied  by  Cybenko  [2].  Park  and  Sandberg  [10]  studied  the  approximation 
properties  of  radial  basis  function  neural  networks.  Barron  [1]  gave  striking 
bounds  for  the  approximation  error  committed  when  approximating  a  given 
function  /  :  E”  ^  M  by  superpositions  of  a  sigmoidal  function.  The  bounds, 
surprisingly,  do  not  depend  on  the  dimension  n  of  the  space  considered. 

Barron  considers  the  mean  square  error  (the  error  in  the  norm  of  L2), 
whereas  we  consider  uniform  approximation  and  consequently  the  sup  norm 
(or  the  Loo  norm).  Mean  square  and  uniform  convergence  are  distinct  concepts: 
a  sequence  of  functions  may  converge  in  the  norm  of  L2  but  not  in  the  sup 
norm.  Two  given  functions  /  and  g  may  differ  by  very  little  in  the  L2  norm, 
while  differing  by  arbitrarily  large  numbers  in  the  Loo  norm.  On  the  other 
hand,  if  the  domain  of  the  functions  is  a  compact  interval,  convergence  in  L<x> 
does  imply  mean  square  convergence.  Thus,  the  Loo  error,  despite  the  challeng¬ 
ing  analytical  difficulties  that  it  often  poses,  may  help  in  understanding  these 
approximation  problems. 

The  approximation  problem  discussed  here  was  studied  in  [6]  for  the  Gaus¬ 
sian,  one-dimensional  case.  The  approach  that  we  use,  in  contrast  to  others,  is 
constructive  (it  yields  the  values  of  the  parameters,  as  well  as  the  bounds).  It 
also  relates  to  some  recently  obtained  results  concerning  nonuniform  sampling 
approximations  [3--5]. 

Recently,  we  came  across  [11],  which  also  addresses  the  uniform  approxi¬ 
mation  problem^  and  presents  an  interesting  treatment  of  the  approximating 
power  of  a  network.  The  bounds  given  are  distinct  from  ours.  Asymptotically, 
they  are  weaker  (a  brief  comparison  is  outlined  below). 


3  THE  ONE-DIMENSIONAL  PROBLEMS 

The  notation  /  G  BV  means  that  the  function  /  is  of  bounded  variation.  The 
value  of  the  variation  itself  is  V  (/).  The  support  of  a  function  /  is  denoted  by 
suppif). 

il  am  grateful  to  Prof.  Bock  (Aachen,  Germany)  for  bringing  this  work  to  my  attention 
and  kindly  supplying  a  copy. 
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3.1  Gaussian-based  radial  basis  functions 


It  follows  readily  from  the  work  of  Norbert  Wiener  [12]  on  the  closure  of  trans¬ 
lations  that  superpositions  of  Gaussians  are  dense  in  the  L\  and  L2  spaces. 

Take  Li,  for  example,  and  let  ip  e  Li.  Wiener  showed  that  any  Li  function 
can  be  approximated  in  the  Li  norm  by 

N 

^Ci1p{t-ti), 

i=l 

iff  the  Fourier  transform  ip  has  no  zeros.  The  Fourier  transform  of  the  Gaussian 
certainly  has  no  zeros,  a  fact  that  immediately  implies  the  universal  approxi¬ 
mation  properties  of  sets  of  translated  Gaussians. 

However,  Wiener’s  proof  is  not  constructive.  We  will  show,  with  the  help  of 
a  constructive  procedure,  how  to  pick  c*  and  U  and  to  obtain  a  bound  for  the 
Loo  approximation  error. 

The  idea  is  to  consider  the  convolution 

/+00 

f(T)Git-T,^)dT, 

-00 

where  G(x,<t)  is  Gaussian  with  standard  deviation  a,  /  is  the  function  to  ap¬ 
proximate,  and  f(r  denotes  the  result  of  the  convolution  (which  depends  upon 
(T  through  the  convolution  kernel).  Next,  we  approximate  /  by  /o-,  and  f^r  by 
a  finite  sum.  Since  the  approximation  of  /  by  is  well-known  it  remains  to 
study  the  error  that  arises  when  approximating  the  convolution  by  the  finite 
sum.  This  is  the  purpose  of  the  following  theorem,  in  which  the  support  of  /  is 
assumed  to  be  the  interval  [0,  Ij. 


Theorem  1  Let  f  :  R  R  be  BV ,  with  supp{f)  C  [0, 1].  Denote  by  {tj}  any 
N  reals  such  that 

(2) 

for  1  <i  <  N,  and  let 

/<r(0=  f  fir)G{t-T,(T)dT. 

Jo 

Then, 


Proof:  By  the  mean- value  theorem,  there  exists  in  the  intervals  defined  by 
(2)  points  {^i}i<i<N  such  that 

1  ^ 


This  shows  that 

1  ^ 

/<r(<)  -  Iv  IZ  -tk,(T)  <  V[f{x)G{t  -  X,  (t)]. 

k=l 


The  result  follows  after  evaluating  the  variation  of  the  product.  ■ 

As  a  grows,  /p.  -^  /  in  the  pointwise  sense,  provided  that  /.has  some  regularity. 
For  any  fixed  cr,  it  always  possible  to  select  N  such  that  (3)  becomes  less  than 
any  specified  positive  number.  Hence,  it  is  always  possible  to  obtain  arbitrarily 
good  approximations  to  /: 

II/-«jv||<||/-MI  +  ||/,-sn|| 


in  any  norm. 

We  now  turn  to  approximation  by  superposition  of  sigmoidal  functions. 


3.2  SIGMOIDAL  FUNCTIONS 

Definition  1  A  hounded  function  F  is  sigmoidal  if 

lim  F(x)  =  1,  lim  F{x)  =  0. 

a;->+oo  s->-oo 

In  the  following,  the  dilation  tj;(wx)  of  any  function  xf  is  denoted  by  xpw{x), 
and  the  function  u  is  the  unit  step  function.  It  is  assumed  without  loss  of 
generality  that  the  functions  /  to  approximate  satisfy  /(O)  =  0. 

Lemma  1  If  F  is  sigmoidal,  f  €  BV,  and  supp{f)  C  [0, 1], 


/W  -  f  -  x)  dx 

Jo 


<  eV(f), 


uniformly  in  t,  for  all  w  such  that  |u(x)  —  Fu;(x)|  <  e. 

Proof:  Assume  that  /(O)  =  0.  The  fact  that  /  G  BV  justifies  the  identity 


f(t)  =  [  f{t-x)u{x)dx 

Jo 


-J 


f{x)u{t  —  x)dx, 


and  so 


f(t)~  f  f(x)Fy,(t~x)dx=  [  f(t  ~  x)[u{x)  -  Fy,{x)]dx. 
Jo  Jo 

If  a:  /  0,  given  any  e  >  0  there  is  a  fi  >  0  such  that 

|u(a:)  -  F^,{x)\  <  e 

if  cj  >  n.  Thus, 

/W  -  f  fix)^w{t  -x)dx  <€  [  \f(t  -x)\dx  <  eV{f). 
Jo  Jo 
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Lemma  2  Take  any  N  reals  {^A:}i<Aj<iV  satisfying 


k-1 

N 


<tk< 


AT’ 


k  =  l,2,.,.,N, 


and  consider  the  functions 


/u>(^) 


f{x)Fyj{t  -  x)dx 
f{t  -  x)F^(x)  dx, 


Then 


1  ^ 

*=1 

\fw{t)  -  SwWI  <  - - ’ 


assuming  f  £  BV ,  4>  G  BV . 


Proof:  Similar  to  the  proof  of  theorem  1:  use  the  mean- value  theorem,  then 
bound  the  variation  V[f'{x)Fw(t  —  a:)].  ® 


Theorem  2  For  any  e  >  0  there  is  w  such  that 

\m-sN(t)\  =  o(^^y 

for  any  {tjb}i<fc<JV  satisfying 

Proof:  We  have 

11/  -  5n||  <  11/  -  fwW  +  Wfw  -  5JV||- 

The  first  term  converges  to  zero  as  oo,  and  the  second  term  is  bounded 
by  a  term  inversely  proportional  to  N  (lemma  2).  The  first  term  can  be  made 
smaller  than  e/2  by  picking  w  sufficiently  large.  Once  w  is  fixed,  take  N  so  large 
that  Wfw  -  5jv|l  <  c/2.  * 

It  is  not  necessary  to  have  f'  of  bounded  variation.  The  restriction  may  be 
removed  by  approximating  it  by  an  absolutely  continuous  function  h'  such  that 

|/'(f)  -  h\t)\  <  e,  \nt)  -  h'(t)\ dt  <  n- 
Jo 
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4  THE  MULTIDIMENSIONAL  PROBLEM 

In  this  section  we  sketch  the  solution  to  the  multidimensional  problem  for  radial 
basis  neural  networks.  The  statement  and  proofs  of  the  results  depend  on 
certain  number  theoretic  results  concerning  uniform  distribution,  discrepancy, 
and  numerical  integration  [8].  We  start  with  the  following  inequality 


r  1  ^ 

j^F{x)dx-j^Y^F(xi) 


<  DnV{F), 


where  Dn  denotes  the  discrepancy  of  the  sequence  11,2:2,  ■■ -ajjv  and  V{F)  is 
the  total  variation  of  the  function  F  &  BV  (which  is  assumed  to  be  of  bounded 
variation  in  the  sense  of  Hardy  and  Krause  [8]).  Note  that  a:  €  E",  as  well  as 
each  Xi-  Standard  results  in  the  estimation  of  the  discrepancy  Dn  show  that 


if  the  sequence  Xi  is  a  good  lattice  set.  It  is  possible  to  show  that  the  variation  in 
the  sense  of  Hardy  and  Krause  of  a  product  of  two  functions  V  (fg)  is  bounded 
by  an  expression  of  the  form 

V(f9)  =  WfWooVig)  +  ||p|looV(/)  -1-  0(||/l|oo||^l|oo). 


Letting  F{x)  =  f{x)g{t  -x,a)  leads  to  a  bound  of  the  form 


1  ^ 

sn{x)  =  j^^f{xi)g{x-Xi,a), 


i=l 


|/.{X)  -  SNix)\  =  O  ■ 

The  constant  hidden  by  the  O  notation  depends  on  tr,  but  not  on  N.  The  total 
error  satisfies 

11/  -  SnW  <  11/  -  /trll  +  ||/<7  -  Sn\\- 

The  first  term  is  a  function  of  a  but  not  of  N,  and  can  be  studied  by  well  known 
methods  (the  precise  bound  depends  on  the  regularity  of  /).  The  second  term 
is  0(log”~^  N/N),  and  predicts  a  degradation  of  performance  as  the  dimension 
n  of  the  space  increases.  The  bound  given  in  the  interesting  work  of  Ritter  [11] 
is  0(1/  VA)  (for  sigmoidal  networks).  This  is  is  weaker  for  sufficiently  large  A, 
but  better  for  smaller  N. 
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Neural  Networks  for  Medical  Image  Processing 

David  G.  Brown,  PhD 


Division  of  Electronics  and  Computer  Science 
Center  for  Devices  and  Radiological  Health 
Rockville,  MD 


The  past  two  decades  have  witnessed  an  explosion  of  medical  imaging 
technologies.  The  invention  of  computed  tomography  served  as  a  vital  bridge 
between  our  analog,  plain-film  past  and  our  increasingly  digital  present, 
endowed  with  a  multiplicity  of  sophisticated  new  imaging  modalities.  These 
new  systems  are  exemplified  by  the  complex  world  of  magnetic  resonance 
imaging,  but  include  many  other  types  of  systems  as  well.  In  addition  to  use  as 
components  for  the  functioning  of  these  imaging  systems,  neural  networks  are 
increasingly  being  used  for  image  processing  tasks  such  as  segmentation  and 
identification  of  anatomic  or  functional  structures.  Computer-aided  diagnosis 
(CADx)  is  becoming  increasingly  important  as  data  becomes  available  in 
digital  form-and  becoming  available  in  overwhelming  quantities  now, 
frequently  from  three  and  four  (time)  dimensional  imaging  data  sets.  Neural 
networks  are  being  used  in  commercially  available  as  well  as  research  CADx 
systems.  In  addition,  they  are  being  applied  to  such  problems  as  image 
registration  for  multiple  single-modality  or  multiple  modality  images.  Neural 
networks  have  demonstrated  a  growing  importance  in  this  dynamic  field  of 
modern  medicine. 
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Mixture  of  Discriminative  Learning  Experts  of 
Constant  Sensitivity  for  Automated  Cytology 
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Jenq-Neng  Hwang,  Eugene  Lin 

Information  Processing  Laboratory 
Dept,  of  Electrical  Engineering,  Box  ^  352500 
University  of  Washington,  Seattle,  WA  98195 


Abstract 

One  practical  objective  in  an  automated  cytology  screening 
task  is  to  obtain  as  high  as  possible  specificity  (the  percentage  of 
normal  slides  being  classified  as  normal)  while  attaining  accept¬ 
able  (predefined)  constant  sensitivity.  In  this  paper,  we  propose 
a  new  learning  algorithm  which  continuously  improves  the  speci¬ 
ficity  while  maintaining  constant  sensitivity  for  pattern  classifi¬ 
cation  problems.  We  further  propose  to  integrate  the  pre-trained 
networks  with  constant  sensitivities  into  the  mixture  of  experts 
(MOE)  network  configuration.  This  enables  each  trained  expert 
to  be  responsive  to  specific  subregions  of  the  input  spaces  with 
minimum  ambiguity  and  thus  produces  better  performance. 


1  Introduction 

Thanks  to  recent  advances  in  image  processing  technologies  and  clas¬ 
sification  algorithms,  the  automated  cytology  screening  system  has 
gained  commercial  interest  [5].  In  an  automated  cytology  screener, 
the  object  classification  module  classifies  the  Pap  smeared  object  as  a 
’’normal”  or  an  ’’abnormal”  slide.  Sensitivity,  which  is  defined  as  the 
percentage  of  abnormal  slides  being  correctly  classified  as  abnormal, 
is  a  very  important  factor  to  most  automated  biomedical  applications. 
One  practical  objective  in  training  a  neural  network  for  these  appli¬ 
cations  is  to  obtain  as  high  as  possible  specificity  (the  percentage  of 
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normal  slides  being  classified  as  normal)  while  attaining  acceptable 
(predefined)  sensitivity. 

The  traditional  mechanism  to  achieve  this  objective  is  to  train  the 
classifiers  first  and  then  generate  the  Receiver  Operating  Characteris¬ 
tics  (ROC)  curve  based  on  different  thresholds.  By  fixing  a  constant 
sensitivity,  the  corresponding  threshold  and  specificity  can  thus  be  de¬ 
termined.  If  the  specificity  is  not  acceptable,  then  the  whole  training 
process,  which  might  involve  a  new  classifier  structure,  needs  to  be 
reinitiated  until  the  acceptable  specificity  is  attained  under  the  prede¬ 
fined  sensitivity  level.  In  reality,  the  inherent  tradeoff  between  sensi¬ 
tivity  and  specificity  prevents  high  specificity  in  case  of  high  sensitivity. 
Therefore,  we  would  like  to  explore  a  neural  network  learning  proce¬ 
dure  that  can  overcome  this  problem  directly  in  the  learning  phase 
without  the  need  of  varying  the  thresholds  after  training  or  reinitiat¬ 
ing  the  learning.  In  this  paper,  we  adopt  the  discriminative  learning 
neural  network  techniques  [4,  3]  for  automated  cytology  screening  and 
propose  a  new  learning  procedure  which  can  obtain  high  specificity 
while  maintaining  an  acceptable  constant  sensitivity. 

The  discriminative  learning  of  a  feedforward  network  distinguishes 
itself  from  the  traditional  backpropagation  learning  by  having  a  dif¬ 
ferent  cost  function.  The  presence  of  the  discriminative  cost  function 
has  a  profound  impact  on  the  learning  capability  and  performance  of 
the  network,  and  usually  results  in  better  performance.  Built  upon  the 
discriminative  learning  algorithm,  we  propose  the  constant  sensitivity 
learning  procedure  which  continuously  improves  the  specificity  while 
maintaining  constant  sensitivity  for  pattern  classification  problems. 

Mixture  of  experts  (MOE)  learning,  [2]  has  been  shown  to  pro¬ 
vide  better  performance  due  to  its  ability  to  effectively  solve  a  large 
complicated  task  by  smaller  and  modularized  trainable  networks  (i.e., 
experts),  whose  solutions  are  dynamically  integrated  into  a  coherent 
one  using  the  trainable  gating  network.  One  of  the  major  concern 
in  using  the  MOE  is  the  lack  of  a  meaningful  interpretation  of  each 
trained  expert  when  all  the  expert  networks  and  the  gating  network 
are  trained  simultaneously.  This  concern  is  further  amplified  when 
using  the  MOE  for  medical  diagnosis  applications,  where  each  reason¬ 
ing  step  leading  toward  the  final  decision  needs  to  be  self-explanatory. 
It  is  highly  desired  to  have  each  trained  expert  to  be  responsive  to 
specific  (non-overlapping)  subregions  of  the  input  space  so  that  the 
gating  network  can  unambiguously  identify  and  integrate  the  correct 
solution.  In  this  paper,  we  thus  propose  to  pre- train  each  expert  net- 
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work  which  operates  in  a  fixed  sensitivity  accomplished  by  the  modified 
discriminative  learning  strategy.  This  enables  the  MOE  network  to  sys¬ 
tematically  identify  a  good  operating  point  in  the  Receiver  Operating 
Characteristics  (ROC)  curve  and  thus  produces  better  performance. 
This  approach  is  based  on  the  principle  of  divide- and- conquer  such 
that  each  expert  network  can  represent  some  distinct  subregions  of  the 
input  space  with  the  minimum  amount  of  overlap  so  that  the  gating 
network’s  output  probabilities  associated  with  those  subregions  can  be 
quite  distinct,  i.e.,  a  clear  winner  can  be  easily  identified. 

We  applied  the  proposed  constant  sensitivity  procedure  to  auto¬ 
mated  cytology  screening  tasks.  The  proposed  discriminative  training 
with  constant  sensitivity  outperforms  the  traditional  backpropagation 
learning  and  the  discriminative  learning  without  enforcing  the  con¬ 
stant  sensitivity.  Furthermore,  the  proposed  mixture  of  discriminative 
learning  experts  of  constant  sensitivity  (MDLECS)  architecture  fur¬ 
ther  improves  the  performance  and  outperforms  the  standard  MOE 
techniques. 

2  Discriminative  Learning 

The  discriminative  learning  [4,  3]  was  proposed  specifically  for  pattern 
recognition  problems,  aiming  at  achieving  a  minimum  classification 
error  rate.  Based  on  a  given  set  of  training  samples,  the  objective 
criterion  is  defined  by  the  classification  rule  in  a  functional  form  and 
is  optimized  by  numerical  search  algorithms.  Under  the  backpropa¬ 
gation  learning  framework,  the  discriminant  functions,  {/j(x;  W),2  = 
1, 2, ...,  M},  which  are  the  neural  network  outputs  and  indicate  the  clas¬ 
sification  posterior  probabilities  P(i|x)  [6],  are  first  calculated,  where 
W  denotes  the  parameter  set  of  the  classifier  (i.e.,  the  feedforward 
network  weights  {'U^tj(0}  /-th  layer)  and  the  training  sample  x 

is  known  to  belong  to  one  of  M  classes.  For  each  input  x,  the  classifier 
makes  its  decision  by  choosing  the  largest  of  the  discriminants  evalu¬ 
ated  on  X.  A  misclassification  measure  for  this  data  is  then  defined  as 
follows: 

i 

n 

(1) 


di(x)=-/i(x;W)  + 


M 


M-  1 


where  77  is  a  positive  number. 
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Finally,  the  minimum  error  objective  is  formulated  and  is  expressed 
as  a  differentiable  function  of  the  misclassificatiqn  measure.  More 
specifically,  the  error  objective  function  Ek  of  the  Aj-th  class  is  defined 
as 

£*(x;W)  =  Efc(4(x))  =  ^^p^,  r>0.  (2) 

Note  that  a  positive  djfc(x)  leads  to  a  penalty  which  is  a  count  of  clas¬ 
sification  error,  while  a  negative  dfc(x)  implies  a  correct  classification. 

3  Constant  Sensitivity  Discriminative  Learn¬ 
ing 

To  achieve  high  specificity  while  maintaining  constant  sensitivity,  we 
propose  a  new  procedure  built  upon  the  discriminative  learning  algo¬ 
rithm  on  a  feedforward  neural  network  [7],  i.e.,  a  2-layer  perceptron. 
The  training  sample  x  is  known  to  belong  to  one  of  2  classes:  the  abnor¬ 
mal  slides  as  Class  1  and  the  normal  slides  as  Class  2.  We  also  define  a 
constant  Aq  which  is  the  sensitivity  value  to  be  maintciined  during  the 
optimization  process.  In  this  2-class  application,  with  rj  approaches  oo 
in  Eq.  (1),  it  results  in  the  simple  misclassification  measure  dA:(x): 

di(x)  =  “/i  -b  /2,  d2(x)  =  -/2  +  /i-  (3) 

Instead  of  using  the  (sigmoid)  error  objective  function  Ek  of  Eq. 
(2),  we  also  tried  another  error  objective  function: 

Ek{x-,W)  =  Ek{dk{^))  =  Cl  *  4  +  C2  *  d;,~\  (4) 

where  ci  >  0,  C2  >  0,  and  r  >  0.  In  our  simulation,  ci  =  1,  C2  =  2, 
and  T  =  2  are  used,  that  is, 

Ek{yi-,W)  =  Ek{dk{x))=:4  +  2*dk.  (5) 

The  optimization  process,  which  increases  the  specificity  over  all 
the  normal  training  data  while  maintaining  constant  sensitivity  Aq 
over  all  the  abnormal  training  data,  uses  the  simple  iterative  gradi¬ 
ent  descent  search  algorithm  by  separately  dealing  with  Ei  and  E2- 
A  ’’batch”  training  procedure  is  used,  i.e,,  the  weights  won’t  be  up¬ 
dated  until  the  accumulation  of  weight  changes  of  all  training  samples 
in  one  iteration.  The  calculation  of  sensitivity  is  always  based  on  the 
threshold  of  0.5. 
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1.  For  each  training  input  {x^")}  in  Class  1  (abnormal): 


f  i\ 


^abnormal  ^  -^0  (6) 


^abnormal  -^0  C*^) 


where  )^abnormai  deiiotes  the  sensitivity  value  evaluated  over  all 
abnormal  training  samples  at  the  previous  iteration. 

2.  For  each  training  input  in  Class  2  (normal): 

Q  P 


In  case  of  oscillation  of  Xabnormai  around  Ao,  the  updating  steps,  ai 
and  02,  are  gradually  decreased. 


4  Mixture  of  Experts 

To  achieve  the  goal  of  a  meaningful  interpretation  of  experts  in  a  mix¬ 
ture  of  experts  (MOE)  network  and  also  to  have  each  trained  expert 
to  be  responsive  to  non- overlapping  subregions  of  the  input  spaces,  we 
further  integrate  the  proposed  constant  sensitivity  learning  algorithm 
into  a  mixture  of  experts  architecture  [2],  where  each  expert  network 
is  pre- trained  with  different  constant  sensitivity. 

For  a  given  input  x,  the  total  probability  of  generating  class  y  from 
X  based  on  a  iT-expert  MOE  is  computed  by: 

P{y\x,  <^)  =  E  5i-P(y  |x,  Oi)  (9) 

1=1 

where  y  is  a  binary  vector,  e.g.,  y  is  either  [1,0]  or  [0, 1]  for  a  2-class 
problem.  <p  is  the  set  of  parameters  associated  with  the  gating  and  the 
expert  networks,  {gi}  are  the  gating  probabilities  for  weighting  the 
expert  outputs,  {-P(y|x,  {^i})},  and  {Bi}  are  the  parameters  for  the 
i-th  expert  network  {i  —  l,...,iir). 

The  gating  network  can  be  a  nonlinear  neural  network  or  a  lin¬ 
ear  combiner.  To  obtain  the  gating  network  probabilistic  outputs,  the 
softmax  function  is  adopted  [1].  The  learning  algorithm  for  the  MOE 
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is  based  on  the  maximum  likelihood  principle  to  estimate  the  param¬ 
eters,  the  synaptic  weights  associated  with  the  gating  and  the  expert 
networks.  In  this  training  procedure,  the  resulting  threshold  is  always 
set  as  0.5.  The  gradient  ascent  algorithm  can  be  used  to  estimate  the 
parameters.  Since  the  expert  networks  have  been  pre-trained  by  the 
constant  sensitivity  discriminative  learning  algorithm,  therefore  the  pa¬ 
rameters  for  the  expert  networks  remain  unchanged  during  the  training 
processing  of  the  MOE.  Note  that,  in  this  mixture  of  discriminative 
learning  experts  of  constant  sensitivity  (MDLECS)  configuration,  the 
overall  sensitivity  value  can  not  be  prefixed  (it  is  within  the  range  of 
our  minimum  and  maximum  sensitivities  provided  by  the  expert  net¬ 
works).  An  obvious  advantage  of  this  architecture  is  that  the  training 
won’t  be  over- generalized,  which  is  always  the  case  when  much  fewer 
abnormal  training  data  are  used  and  the  trained  network  normally  get 
pushed  to  very  high  specificity  with  very  low  sensitivity  when  standard 
network  training  algorithms  are  used. 

5  Comparative  Simulation  Results 

To  verify  the  feasibility  and  superiority  of  the  proposed  constant  sensi¬ 
tivity  discriminative  learning  procedure  and  its  integration  with  MOE, 
we  utilized  the  real  world  cytology  data  provided  by  NeoPath  Inc.,  who 
developed  and  manufactured  the  world’s  first  automated  Pap  Smear 
screening  system  -  the  AutoPap  300.  There  are  32  features  extracted 
from  the  Pap  Smear  images  obtained  from  a  single  slide.  All  these  fea¬ 
tures  were  normalized  to  have  zero  mean  and  unit  standard  deviation 
before  input  to  the  classifiers.  Among  these  data,  5000  slide  samples 
(2880  ’’abnormal”,  2120  ’’normal”)  are  used  as  the  training  set,  and 
another  5000  slide  samples  (2880  ’’abnormal”,  2120  ’’normal”)  are  used 
as  the  independent  testing  set. 

The  overall  accuracy,  i.e.,  classification  rate  over  the  testing  data,  is 
used  to  evaluate  the  performance  of  classifiers.  Table  1  shows  the  over¬ 
all  accuracy  of  the  5000  testing  data  under  different  sensitivity  values 
i.e.,  A  is  equal  to  0.75,  0.8,  0.85,  0.9,  and  0.95.  Three  learning  proce¬ 
dures  are  carefully  experimented  and  compared:  the  backpropagation 
learning,  the  discriminative  learning,  and  the  constant  sensitivity  dis¬ 
criminative  learning.  It  appears  that  the  performance  of  the  constant 
sensitivity  discriminative  learning  outperforms  the  other  two  learning 
methods.  Also  note  that  for  the  constant  sensitivity  discriminative 
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learning,  the  use  of  the  error  objective  function  in  Eq.  (4)  provides 
better  accuracy  than  that  in  Eq.  (2).  All  these  simulations  use  one- 
hidden  layer  feedforward  neural  network  with  the  same  size  of  hidden 
units. 

Based  on  the  five  networks  pre-trained  by  the  constant  sensitivity 
discriminative  learning,  we  further  built  the  mixture  of  experts  with 
a  linear  gating  network.  Each  expert  network  is  a  one-hidden  layer 
feedforward  neural  network  with  10  hidden  units,  and  the  gating  net¬ 
work  is  a  single-layered  feedforward  neural  network  with  the  softmax 
outputs.  The  overall  accuracy  of  the  mixture  of  experts  is  78.54  % 
with  an  overall  sensitivity  of  0.82.  The  performance  was  further  im¬ 
proved  to  79.04  %  (with  a  sensitivity  of  0.81)  when  a  nonlinear  gating 
network  was  used  (i.e.,  2-layer  feedforward  perceptron  with  20  hid¬ 
den  units).  These  performance  are  favorably  compared  with  that  of 
a  2-layer  feedforward  perceptron  trained  at  constant  sensitivity  (fixed 
at  0.80)  discriminative  learning  with  50  hidden  units  (77.56%  with 
testing  sensitivity  0.77),  which  has  similar  size  of  parameters  as  the 
proposed  MDLECS  (we  also  tried  the  70-unit  2-layer  perceptron,  the 
performance  is  almost  the  same  as  50-unit  one).  Note  specifically  a 
higher  sensitivity  is  consistently  achieved  by  the  proposed  mixture  of 
discriminative  learning  experts  of  constant  sensitivity  (MDLECS).  To 
have  a  fair  comparison,  we  also  trained  a  standard  MOE  network  with¬ 
out  pre-trained  constant  sensitivity  experts.  Each  of  the  five  expert 
networks  has  10  hidden  neurons  and  the  nonlinear  gating  network  con¬ 
tains  20  hidden  neurons,  both  expert  and  gating  networks  are  trained 
by  gradient  ascent  algorithm  simultaneously.  The  resulting  accuracy 
is  78.18  %  (with  a  sensitivity  of  0.79). 

By  observing  the  gating  network  output  probabilities  associated 
with  5  experts,  we  are  able  to  find  some  clues  to  explain  the  better  per¬ 
formance  of  the  proposed  MDLECS.  Table  3  shows  12  representative 
output  probabilities  of  the  gating  network  trained  with  the  standard 
MOE  procedure.  Note  that  the  MOE  did  a  good  job  in  separating 
the  input  space  into  subregions,  therefore  only  one  dominating  prob¬ 
ability  exists  in  most  cases.  The  proposed  MDLECS  did  a  even  more 
clear-cut  partition  of  input  subregions,  therefore  the  one- dominating 
probability  situation  is  even  more  obvious  for  the  same  set  of  testing 
data,  as  evidenced  by  the  bold  face  numbers  in  Table  3. 
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Sensitivity  Aq 

0.75 

0.8 

0.85 

0.9 

0.95 

B  ackpropagation 

75.70 

75.88 

73.32 

70.38 

65.56 

Discriminative 

76.68 

76.62 

74.82 

71.88 

66.46 

Constant  Sensitivity 
with  Eq.  (2) 

76.92 

77.16 

76.26 

74.68 

69.50 

Constant  Sensitivity 
with  Eq.  (4) 

77.40 

78.16 

77.78 

75.50 

71.28 

Table  1:  The  comparative  testing  accuracy  (%  correct)  for  the  com¬ 
parative  simulations  among  three  learning  procedures. 


Methods 

Testing  Accuracy 

Sensitivity 

Fixed  Constant  Sensitivity  BP 
(32-50-2) 

77.56% 

0.80 

Standard  MOE 

78.18% 

0.79 

MDLECS  (Linear  Gating) 

78.54% 

0.82 

MDLECS  (Nonlinear  Gating) 

79.04% 

0.81 

Table  2:  The  comparative  testing  accuracy  (%  correct)  and  the  result 
ing  sensitivity  for  various  mixture  of  experts  configurations. 
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MOE 

9i 

92 

93 

9a 

95 

Data  1 

0.23463 

0.00089 

0.75585 

0.00420 

0.00441 

Data  2 

0.56412 

0.06864 

0.35065 

0.01592 

0.00064 

Data  3 

0.91152 

0.01505 

0.06893 

0.00423 

0.00023 

Data  4 

0.06379 

0.01670 

0.04422 

0.84404 

0.03123 

Data  5 

0.64777 

0.20182 

0.04258 

Data  6 

0.78372 

0.00229 

0.20898 

0.00474 

0.00024 

Data  7 

0.56004 

0.00370 

0.42796 

0.00713 

0.00115 

Data  8 

0.15883 

0.63616 

0.06820 

Data  9 

0.41742 

0.00432 

0.57676 

0.00120 

0.00027 

Data  10 

0.00380 

0.46015 

Data  11 

0.15195 

0.30548 

0.05661 

0.48406 

0.00188 

Data  12 

0.87367 

0.00080 

0.12464 

0.00082 

9i 

92 

93 

9a 

95 

0.94406 

0.00002 

0.00017 

0.00008 

0.05564 

Data  2 

0.02298 

0.00167 

0.04568 

0.00610 

0.92354 

Data  3 

0.01396 

0.00158 

0.04408 

0.00779 

0.93257 

Data  4 

0.37157 

0.61927 

0.00115 

0.00529 

0.00269 

Data  5 

0.96350 

0.02500 

0.00029 

0.00278 

Data  6 

0.72072 

0.00299 

0.00796 

0.26785 

Data  7 

0.88323 

0.00381 

0.00403 

Data  8 

0.05907 

0.00004 

0.00001 

0.94084 

0.00002 

Data  9 

0.99580 

0.00332 

0.00007 

0.00008 

0.00071 

Data  10 

0.95901 

0.04066 

0.00001 

0.00000 

0.00030 

Data  11 

0.77697 

0.04275 

0.15508 

0.00580 

0.01938 

Data  12 

0.86628 

0.00741 

0.00025 

0.00000 

0.12604 

Table  3:  The  output  probabilities  of  the  gating  network  for  12  repre¬ 
sentative  testing  data  trained  with  the  standard  MOE  and  MDLECS 
procedures. 
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6  Conclusion 


In  this  paper,  we  present  a  new  learning  procedure  built  upon  the  dis¬ 
criminative  learning  algorithm  to  achieve  high  accuracy  while  main¬ 
taining  constant  sensitivity.  We  further  integrate  these  trained  con¬ 
stant  sensitivity  networks  into  the  mixture  of  experts  (MOE)  config¬ 
uration,  which  results  in  very  encouraging  simulation  performance  in 
automated  cytology  screening  applications. 
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Abstract 

Two  different  measurement  modalities,  one  related  to  blood 
flow,  the  other  related  to  brain  metabolism  are  monitored  in 
a  head  injury  patient  and  analyzed  by  using  the  method  of 
surrogate  data.  That  is  applied  against  a  hierarchy  of  two- 
dimensional  Markov  processes,  designed  to  model  a  possible 
deterministic  behaviour  of  the  system  and  correlations  be¬ 
tween  the  two  observed  variables.  Two-layered  feed-forward 
neural  networks  are  trained  to  estimate  the  two-dimensional 
conditional  densities  of  the  proposed  Markov  models.  A  cu- 
mulant  based  information  flow  is  here  used  for  testing  the  ob¬ 
served  dynamics  against  the  hierarchy  of  null  hypotheses.  A 
deterministic  dynamics  corresponding  to  a  low  order  Markov 
process  was  found  in  both  time  series.  In  addition  some  cor¬ 
relation  was  detected  indicating  a  coupling  of  the  blood  flow 
and  the  metabolism  related  parameters  depending  on  patient 
condition.  The  proposed  method  could  be  an  useful  tool  for 
detecting  malfunction  in  the  regulation  of  the  human  basic 
metabolism  and  predicting  its  evolution  inside  a  reasonable 
window  time. 

1.  MONITORING  OXYGEN  METABOLISM  IN  THE  BRAIN 

The  measurement  of  the  brain  metabolism  and  related  parameters  is  becom¬ 
ing  a  more  and  more  helpful  tool  to  assess  the  follow  up  of  people  with  serious 
trauma  of  the  central  nervous  system. 

In  a  head  injury,  first  the  space  occupying  lesions  is  removed  surgically, 
whenever  necessary.  Then  the  brain  edema  is  treated  not  surgically  by  hy¬ 
perventilation,  sedation,  and  osmotic  agents.  Specifically  the  conservative 
treatment  is  guided  by  various  parameters  which  are  recorded  on  line  in  the 
Intensive  Care  Unit  and  give  hints  of  the  metabolic  state  of  the  brain.  How¬ 
ever  treatment  guided  by  these  parameters  is  reactive.  That  means  that 
whenever  the  parameters  give  an  indication  of  a  deterioration,  treatment  is 
adjusted.  It  is  highly  desirable  to  know  those  episodes  of  deterioration  in  ad- 
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vance,  since  the  treatment  could  be  more  appropriate  and  avoid  crisis.  Quite 
often  very  similar  changes  in  the  parameter  trends  can  lead  to  very  different 
consequences.  It  would  be  useful  to  evaluate  whether  the  time  course  of  such 
variables  in  head  injured  patients  shows  any  deterministic  structure  and  then 
whether  it  is  possible  to  characterize  those  clinical  situations. 

In  this  work  we  analyze  time  series  of  data  coming  from  two  devices 
during  different  coma  conditions  of  the  same  patient.  One  parameter  is  the 
local  partial  oxygen  pressure  of  frontal  white  matter  of  the  brain  (tip02). 
This  parameter  is  more  related  to  blood  flow  and  diffusion  dependent  oxygen 
delivery.  The  second  parameter  defines  the  oxygen  loading  of  hemoglobin  in 
venous  and  arterial  blood,  and  is  more  related  to  metabolism  (HgB02). 

We  want  to  analyse  two  issues  in  this  study.  First  of  all  the  Markovian 
character  of  the  underlying  process  is  investigated,  in  order  to  model  the 
system  structure  and  to  perform  some  prediction  on  the  future  values.  Then 
a  possible  correlation  of  the  two  signals  is  studied.  A  strong  correlation  would 
indicate  an  intact  regulation  of  the  brain  metabolism,  whereas  no  correlation 
would  indicate  a  serious  disruption  of  this  regulatory  mechanism. 

A  statistical  approach  for  detecting  the  Markovian  character  of  dynamical 
systems  by  analyzing  their  information  flow  is  here  applied  to  the  two  signals 
(tiP02  and  HgB02)  alone  or  in  combination,  in  order  to  detect  deterministic 
behaviour  and/or  correlation. 

A  measure  based  on  higher  order  cumulants  depending  on  the  past  values 
of  both  time  series  is  calculated.  That  quantifies  the  statistical  dependences  of 
a  point  r  steps  ahead  in  one  of  the  two  time  series  on  their  past  values.  These 
cumulant  based  information  flows,  expressed  as  a  function  of  the  lookahead 
r,  are  here  used  as  a  discriminating  statistics  in  testing  the  observed  dynam¬ 
ics  against  a  hierarchy  of  null  hypotheses,  corresponding  to  two-dimensional 
nonlinear  Markov  processes  of  increasing  order  [1].  The  process  can  be  reit¬ 
erated  generating  higher  and  higher  order  Markov  models,  corresponding  to 
better  approximations  of  the  underlying  two-dimensional  process  in  terms  of 
information  flow. 

Two-layered  feed-forward  neural  networks  are  trained  to  estimate  the  con¬ 
ditional  densities  of  each  Markov  process. 

Not  too  many  studies  have  been  performed  in  this  area  yet,  since  the 
technological  development  only  recently  made  available  sophisticated  mea¬ 
surement  systems.  Previous  studies  were  oriented  to  analyze  the  case  under 
a  strictly  medical  point  of  view  [6]  or  to  check  for  chaotic  behaviour  [5]. 

Some  deterministic  dynamics,  corresponding  to  a  low  order  Markov  pro¬ 
cess,  was  detected  in  both  time  series.  A  strong  correlation  was  also  detected 
in  normal  conditions  of  the  patient,  while  one-dimensional  Markov  models 
were  sufficient  to  describe  both  information  flows  when  the  patient  was  in 
worse  clinical  conditions. 
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2.  TESTING  NONLINEAR  MARKOVIAN  BIVARIATE  HYPOTH¬ 
ESIS 

Four  cumulant  based  measures  of  the  statistical  dependences  of  the  points 
r  steps  ahead  in  the  time  series  Xt  and  yt  are  derived  based  on  and  Uy 
succeeding  observations  of  the  two  time  series.  They  indirectly  describe  the 
loss  of  information  in  the  underlying  dynamical  system.  Such  measures  are 
here  used  as  a  discriminating  statistics  to  accept  or  reject  a  null  hypothesis 
consisting  of  a  Markov  model  supposed  to  be  adequate  to  explain  the  system. 

Let  us  define  the  +  riy  +  1-dimensional  vector 

"V d  =  —  •  •  •  i  — 1?  yt  —  riyi  •  •  ♦ }  Vt  —  l^  <^t+r)  (1) 

where  dt+r  represents  alternatively  Xt+r  and  yt+n  t^ien  the  general  equa¬ 
tion  of  those  measures  can  be  defined  as: 


oo  Ua 

i'f')  =  dij_,...,ij,n^+ny+l  (2) 

i=l  ii.‘djz=Ld 

where  the  Kdi^...ij  represent  the  cumulant  of  order  j  calculated  on  the 
vector  Vd  and  d  refers  to  the  time  series  xt  or  yt.  The  dependency  of  Xt+r 
from  the  past  values  of  xt  is  measured  by  calculated  by  using 

the  vector  ^/d=x  with  2/^=1,  and  Ud  =  n^.  Extending  the  calculation  to 
the  following  Uy  values  in  the  vector  Vj.  a  measure  (r)  of  the  dependency 
of  Xt+r  itself  from  the  past  of  both  time  series  is  obtained.  In  this  case 
Ud  =  Tlx  +  riy.  Similar  measures  can  be  defined  for  the  time  series  yt.  A 
measure  myy(r)  of  the  dependency  of  yt+r  on  its  past  is  produced  by  using 
dt+r  =  yt+r  in  vector  Vd=y  with  Ld  —  rix  and  Ud  —  rix  Uy  in  the  eq.  2. 
Extending  the  sum  to  the  previous  rix  values  in  the  vector  Vy,  that  means 
setting  Zrrf  =  1,  a  measure  my(r)  of  dependency  of  yt+r  on  the  past  of  both 
time  series  is  obtained. 

These  four  measures,  ma;(r),  my(r),  and  myy(r),  are  a  cumulant 

based  characterization  of  the  information  flow  of  the  two-dimensional  under¬ 
lying  system  and  measure  the  dependences  of  Xt+r  a.nd  yt+r  respectively  from 
their  rix  and  riy  past  values  or  from  both.  They  can  also  be  seen  as  measures 
of  the  dependence  of  xt  on  yt  and  viceversa.  The  first  sum  in  2  is  approxi¬ 
mated  with  a  finite  number  of  terms,  and  the  cumulants  are  calculated  up  to 
the  fourth  order  [1]. 

The  method  of  surrogate  data  [2]  is  separately  applied  to  the  two  time 
series.  We  define  a  null  hypothesis  as  a  two-dimensional  Markov  model  sup¬ 
posed  to  be  an  adequate  explanation  of  the  underlying  system.  A  discriminat¬ 
ing  statistics  involving  the  cumulant  based  measures,  mx{r)^  rnxx(r), 

and  myy(r),  allows  us  to  quantify  the  consistency  of  our  null  hypothesis  with 
the  property  of  the  original  system.  The  two-sample  Student  t  test  [1]  is 
adopted  as  a  discriminating  statistics,  expressed  by  the  following  variable 
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(3),  which  has  a  t  student  distribution  with  M  —  1  freedom  degrees,  M  being 
the  number  of  used  surrogate  data  sets: 


where  d  represents  the  process  xt  or  ?/t,  di  indicates  the  corresponding 
surrogate  time  series,  and  r  the  lookahead.  The  null  hypothesis  is  accepted 
if  the  absolute  value  |^(r)|  is  smaller  than  i  corresponding  to  a  p- value,  which 
depends  on  M  [1],  for  lookahed  1  <  r  <  6  for  all  the  four  cumulant  based 
measures.  In  this  case  our  assumption  about  the  original  time  series  is  ade¬ 
quate.  A  hierarchy  of  null  hypothesis  is  defined  increasing  the  order  of  the 
Markov  model,  whenever  the  null  hypothesis  is  rejected. 

Before  beginning  the  analysis,  the  time  series  data  are  independently 
gaussianized  to  avoid  possible  static  nonlinearities  due  to  the  measurement 
process  [2].  For  that  purpose,  Gaussian  random  numbers  are  computed 
and  re-ordered  so  that  the  ranks  of  both  sequences  -  the  original  and  the 
gaussianized  one  -  agree.  The  next  step  is  to  determine  the  order  {rix^ny} 
of  the  two-dimensional  nonlinear  Markov  process  which  is  supposed  to  ap¬ 
proximate  the  information  flow  of  the  transformed  sequences.  If  we  do  not 
have  detailed  knowledge  about  the  observed  dynamical  system  we  start  with 
Ux  —  l^riy  —  We  do  not  deal  with  the  case  =  0, ny  =  0  (white  noise  in 
both  signals,  since  physicians  already  find  some  information  in  those  signals). 

Two-layered  feed-forward  neural  networks  are  trained  to  perform  an  es¬ 
timation  of  the  conditional  densities  p  (xt+r  . . . ,  and 

p  (pt+r  kt,  • .  • ,  Xt-n^,pt,  *  • . ,  of  the  proposed  Markov  model. 

In  order  to  provide  to  the  networks  the  minimum  number  of  significant 
inputs,  a  time  delay  is  applied  to  the  original  time  series.  The  first  minimum 
of  the  mutual  information,  calculated  as  in  [7],  is  adopted  as  the  best  available 
systematic  criterion  for  choosing  time  delays. 

Since  it  has  been  shown  that  nonlinear  neural  networks  are  very  suitable 
for  this  purpose  [4]  we  decided  to  approximate  all  the  density  functions  with: 


165 


I  !  riy:  riy  \ 

(tI  =  v/,0  +  Vhi  tanh  I  wmo  +  ^  WhijXt-j  +  ^  Whi{n^+z)yt-z  1  (8) 

i=l  \  j=l  ^=1  / 

where  k  denotes  the  number  of  Gaussians,  I  is  the  number  of  hidden 
neurons  and  the  v/i,-,  ly/iij,  u/i*,  u/»i,  and  Whij  are  the  parameters  of 

the  networks.  Additionally,  the  constraint  Ylh-i  u/i  =  1  holds  to  ensure  that 
the  sum  (5)  is  really  a  density  function. 

Each  conditional  probability  density  is  therefore  represented  by  a  weighted 
sum  of  normal  distributions,  the  weights,  means,  and  variances  of  which  are 
the  outputs  of  different  neural  networks  (multi-layer  perceptrons,  (6)  -  (8)) 
for  the  (ria;  +  ny)-dimensional  input.  The  training  is  performed  following  the 
maximum  likelihood  principle  [3]  in  the  sense  that  the  parameters  of  the  net¬ 
works  are  updated  according  to  the  gradient  descent  on  the  log-likelihood 
function: 


-  logp(dt+r;  uu  . . . ,  Ufc,  //fc,  al).  (9) 

t 

After  the  training  the  networks  can  build  new  sequences  using  the  Marko¬ 
vian  approximation  of  the  information  flow  of  the  time  series  Xt  and  yt  start¬ 
ing  from  the  first  values  of  xt  and  the  first  Uy  values  of  yt.  The  new 
sequences  Xi  and  yi,  (a  surrogate  data  set),  are  gaussianized  to  have  the 
same  marginal  distribution  as  the  original  ones.  An  arbitrary  number  M  of 
surrogate  data  sets  are  generated  and  the  measures  m£(r),  my(r),  (r), 
and  Wyy  (r),  are  independently  calculated  for  each  one.  If  the  hypothesis  that 
the  nonlinear  Markov  process  of  order  {ux,  riy}  is  appropriate  to  explain  the 
data,  is  rejected,  the  order  is  increased  to  + 1  or  Uy  1,  and  the  procedure 
is  repeated  starting  with  the  training  of  the  neural  networks.  The  hypothesis 
is  finally  accepted  if  both  tests  are  accepted  for  lookaheads  1  <  r  <  6. 

Starting  with  rix  =  1  and  riy  =  0  and  gradually  increasing  the  order  of  the 
Markov  process  makes  the  detection  of  the  appropriate  Markov  model  with 
minimum  order  possible.  Correlations  between  the  two  time  series  are  also 
detected,  if  the  Markov  model  with  minimum  order  requires  both  and  Uy 
different  from  0.  The  complete  method  is  resumed  in  figure  1. 

3.  RESULTS 

Two  time  series  of  oxygen  and  metabolism  related  parameters  of  the  brain  are 
here  analyzed  in  a  comatous  patient  after  head  injury.  They  are  recorded  at 
two  different  times,  corresponding  to  two  different  patient  conditions.  During 
time  A  a  clinically  relatively  stationary  condition  was  recorded,  while  during 
time  B  the  patient  w^ls  in  critical  conditions. 

The  first  measure  trend  (tip02)  invasively  measures  the  local  tissue  oxy¬ 
gen  tension  in  the  white  matter  of  the  brain.  The  second  technique  locally 
measures  the  oxygen  loading  of  hemoglobine  (HgB02)  in  arterial  and  venous 
blood  of  the  brain.  Both  techniques  adopt  a  4s/m  sample  frequency. 
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P(x(t+1)  I 
x(t), ....  x(t-n:^, 
y(t) . y(t'ny)) 


Figure  1:  Block  diagram  for  the  minimum  order  model  selection. 


The  method  of  surrogate  data  is  applied  to  the  two  time  series  against 
a  hierarchy  of  two-dimensional  Markov  processes,  looking  for  correlations 
between  the  two  signals  and  for  deterministic  behaviour. 

In  the  figures  2  and  3,  the  cumulant  based  discriminating  measures  of  the 
original  data,  -  mx(r),  m!rx(r),  my(r),  myy(r)  (2)  ~  and  the  corresponding 
averaged  ones  of  the  surrogate  data  -  (^)  “ 

different  Markov  models  are  displayed  as  functions  of  the  lookahead  r.  The 
order  of  the  Markov  models  is  indicated  in  parentheses.  In  both  cases,  it  is 
easy  to  observe  the  progressive  approaching  of  the  averaged  discriminating 
measures  of  the  surrogate  data  to  the  original  one  with  the  increasing  of  the 
Markov  order. 

Even  though  the  experiments  are  still  in  progress  to  extend  the  analysis 
to  more  patients,  some  deterministic  dynamics,  corresponding  to  a  low  order 
two-dimensional  Markov  process,  was  detected  in  both  time  series  at  both 
times  A  and  B. 

The  order  {1, 1}  is  the  lowest  order  of  the  Markov  processes  being  not 
refused  by  the  discriminating  analysis  of  the  two  time  series  at  time  A  (Fig. 
2).  It  is  easy  to  see  in  figure  2  that  the  statistical  properties  of  the  surrogate 
data,  represented  by  the  variables  ^^(r),  ^y(r),  ^xx{r)^  and  /iyy(r),  fit  the 
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ones  of  the  original  time  series.  In  addition,  while  the  tip02  is  well  modelled 
by  a  one-dimensional  Markov  model  of  order  {1,0}  (Fig.  2. a),  the  minimum 
order  of  the  Markov  model  needs  some  past  values  of  the  tip02,  in  order 
to  represent  the  HgB02  too.  That  shows  a  strong  correlation  between  the 
two  time  series,  due  to  coupling  of  the  two  paramters  indicating  an  intact 
regulation. 

For  the  time  series  referring  to  the  time  B,  two  different  one-dimensional 
Markov  models  are  detected,  respectively  of  order  {2, 0}  for  the  tip02  and 
{0, 1}  for  the  HgB02  (Fig.  3).  That  describes  two  statistically  independent 
processes.  In  this  case  coupling  of  blood  flow  and  metabolism  is  disrupted 
indicating  a  serious  disturbance  of  regulatory  mechanisms. 

We  can  see  that  the  combined  model  of  order  {2, 1}  approximates  the 
statistical  behaviour  of  the  two  time  series  even  better.  That  proves  the 
efficiency  of  the  proposed  algorithm,  since  Markov  processes  with  limited 
higher  orders  are  still  able  to  model  the  original  process. 

Both  cases,  A  and  B,  show  a  system  characterized  by  a  low  order  dy¬ 
namics,  but  the  high  amount  of  noise  in  the  measurement  process  could  hide 
higher  order  dynamics.  We  can  not  retrieve  any  information  from  the  biologi¬ 
cal  knowledge  about  the  modelling  of  the  oxygen  metabolism  in  patients  with 
serious  head  injury,  since  the  problem  involves  too  many  variables  and  often 
the  biological  processes  of  interest  are  not  completely  known.  In  this  sense, 
the  method  here  proposed  represents  a  good  tool  to  investigate  and  to  model 
the  dynamics  of  the  brain  circulation,  without  a  priori  biological  knowledge. 
Finally  the  proposed  algorithm  should  allow  to  distinguish  apparently  similar 
conditions  of  the  patient  and  possible  pathologies  in  the  brain  metabolism. 

The  knowledge  in  advance  of  the  regulation  of  the  basic  metabolism  of  the 
patient  would  allow  appropriate  therapies,  avoiding  the  ineffective  ones. 

4.  CONCLUSIONS 

The  measurement  of  the  oxygen  and  metabolism  related  parameters  in  the 
brain  is  a  very  helpful  tool  in  assessing  the  follow  up  of  patients  with  serious 
trauma  of  the  central  nervous  system.  Several  variables  related  with  oxy¬ 
gen  delivery  and  metabolism  in  the  brain  can  be  monitored  with  different 
techniques.  In  addition,  the  forecasting  and  the  modelling  of  such  variables 
would  be  really  important  to  undertake  an  appropriate  therapy. 

In  this  work,  two  different  measurement  trends  related  with  oxygen  metabolism 
in  the  brain  are  examined  in  two  different  clinical  situations  of  the  same  pa¬ 
tient. 

The  method  of  surrogate  data  is  applied  to  the  two  time  series  against 
a  hierarchy  of  two-dimensional  Markovian  processes,  looking  for  correlation 
between  the  two  signals  and  deterministic  behaviour.  It  might  be  possible  in 
that  way  to  characterize  similar  clinical  situations  with  different  follow  up  by 
means  of  different  dynamics. 

Two-layered  feed-forward  neural  networks  are  trained  to  perform  the  es¬ 
timation  of  the  two-dimensional  conditional  densities  of  the  Markov  models. 


168 


A  deterministic  dynamics,  corresponding  to  a  low  order  two-dimensional 
Markov  process,  was  found  in  both  time  series  for  both  clinical  conditions. 
A  correlation  between  the  two  time  series  was  detected  due  to  the  control 
action  of  the  autonomic  nervous  system  on  the  human  basic  metabolism  only 
when  the  patient  was  in  better  clinical  conditions. 

The  proposed  method  could  be  an  useful  tool  for  modelling  and  detecting 
malfunctions  in  the  regulation  of  the  basic  metabolism  and  to  predict  inside 
a  reasonable  time  window  its  evolution.  This  in  advance  knowledge  would 
allow  appropriate  therapies,  avoiding  the  ineffective  ones. 
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ABSTRACT 

In  this  paper,  we  investigate  the  use  of  modular  architecture  of 
multiple  clustering  based  pattern  classifiers  for  ECG  heat 
classification  using  the  MIT/BIH  arrhythmia  database.  The  feature 
space  is  divided  into  several  regions  and  individual  classifiers  are 
developed  for  each  region  separately.  Then  the  outputs  of  these 
classifiers  are  combined  using  two  competing  combination  rules:  a 
winner  decides  all  method  and  a  distance-based  combination 
method.  Experiment  results  indicated  that  multiple  classifier 
approach  yields  better  sensitivity  and  classification  rate. 

I.  INTRODUCTION 

ECG  beat  classification  [1,  2,  4,  7,  8,  11,  13,  16,  21,  22,  25,  29] 
is  a  difficult  pattern  classification  problem  which,  despite  numerous 
previous  attempts,  have  not  been  solved  satisfactorily.  The 
difficulties  stem  from  many  factors,  including  large  dimension  of  the 
feature  space,  large  amounts  of  training  samples,  significant  overlap 
between  class  boundaries  and  the  ever  changing  morphology  (beat 
shape)  with  time,  just  to  name  a  few. 

In  this  study,  we  report  empirical  results  of  performing  ECG 
beat  classification  to  distinguish  normal  heart  beats  to  those  of 
premature  ventricular  contraction  (PVC)  beats  using  a  multi¬ 
classifier  pattern  classification  architecture  [3,  5,  6,  9,  10,  12,  15,  19, 
23,  24,  26,  27,  28].  In  this  architecture,  multiple,  separately 
developed  pattern  classifiers  are  combined  using  a  mixture  of  experts 
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approach.  The  motivations  for  studying  such  an  architecture  are  two 
folded: 

(a)  Performance:  A  modular  architecture  has  the  potential  to 
yield  better  performance  than  a  monolithic  classifier  architecture. 
This  is  a  well  known  fact,  and  has  motivated  the  study  of  committee 
classifiers,  as  well  as  mixture-of-expert  approaches. 

(b)  Speed:  The  modular  classifier  approach  allows  parallel 
development  of  all  component  classifiers.  Moreover,  we  choose  to 
divide  the  training  data  samples  into  disjoint  data  subsets,  and 
therefore  further  reduces  the  training  time  required  for  individual 
classifiers. 


The  component  classifier  experimented  in  this  study  is  the  Self 
Organizing  Maps  (SOM)  [14]  followed  by  the  Learning  Vector 
Quantization  (LVQ)  [14]  approach.  Two  types  of  combination 
methods  are  compared:  a  winner-decides-all  method  and  a  distance 
based  weighted  average  method. 

11.  COMBINING  MULTIPLE  PATTERN  CLASSIFIERS 

In  this  study,  multiple  pattern  classifiers  of  the  same  type  are 
developed  independently  on  different  regions  of  the  feature  space. 
This  is  a  divide-and-conquer  approach  to  deal  with  the  large  number 
of  training  samples.  The  heuristic  is  by  restricting  the  training 
samples  of  each  classifier  to  a  smaller  region  in  the  feature  space,  the 
classifier  can  "zoom-in"  that  region  to  achieve  a  better  classification 
result  when  testing  samples  fall  within  that  region.  This  "modular 
learning"  approach  potentially  offers  both  performance  and  speed 
advantages  as  stated  above.  To  facilitate  the  division  of  the  feature 
space,  in  this  study,  initially,  the  entire  training  data  set  is  clustered 
into  five  clusters  with  clustering  centers  {C(i);  1  <  i  <  5}  using  the 
SOM  algorithm.  The  number  of  clusters  (5  in  our  experiment)  is 
selected  empirically. 

Then  five  LVQ  classifiers  are  developed  on  each  cluster.  The 
LVQ-PAK  software  (v3.0)  is  used  in  this  study.  Parameters  are 
selected  according  to  the  default  setting.  We  denote  the  output  of 
these  classifiers  as  y(i,x),  with  respect  to  an  input  feature  vector  x  for 
1  <  i  <  5.  y(i,x)  =  1  if  X  is  a  normal  ECG  beat,  and  =  -1  if  x  is 
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deemed  to  be  a  PVC  beat.  The  output  of  these  five  classifiers  will 
then  combined  together  using  two  different  combination  methods: 

(a)  Winner  decides  all  method 

For  each  testing  feature  vector  x,  compute  the  distance  from  x 
to  the  clustering  center  C(i)  of  the  training  samples  assigned  to  the  i- 
th  classifier:  d(x,i)  =  Hx  -  C(i)ll^.  Find  i*  such  that 

d(x,i*)  d(x,i)  for  all  i.  (1) 

Then  the  i*-th  classifier  is  designated  as  the  winner,  and  it’s  output  is 
assigned  to  the  output  of  the  combined  classifier  for  x.  That  is, 

y(x)  =  y(x,  i*)  (2) 

The  winner  decides  all  method  is  a  direct  result  of  partitioning  the 
feature  space  into  disjointed  regions  and  training  independent 
classifiers  on  each  region.  Consequently,  if  x  falls  within  that 
region,  the  corresponding  classifier  must  give  the  most  accurate 
result. 

(h)  Distance  Based  Classifier  Output  Weighting 

In  the  distance  based  classifier  output  weighting  method,  we 
compute  the  combined  output  as  a  linear  combination  of  all  outputs 
of  the  five  modular  classifiers,  and  then  threshold  the  output.  That 
is. 


5  5 

y(x)  =  T[^w(i)y(x,i)];  ^ 

'  i=l  i=l 

where  T[x]  =  1  if  x  >  0,  and  -1  if  x  <  0  is  a  threshold  operator.  In 
order  words,  since  y(x,i)  =  1  or  -1,  eq.  (3)  states  that  the  combined 
output  is  obtained  by  a  weighted  voting  of  the  component  classifiers. 
The  question  is  how  to  determine  the  weighting  w(i). 

To  find  an  exact  solution  to  such  a  problem  is  extremely 
complicated.  Instead,  we  opt  for  an  approximated  solution  as 
follows:  We  assume  that  each  y(x,  i)  is  an  independent  estimate  of 


174 


the  target  value  t(x),  with  a  variance  of(x)  =  a  d(x,  i)  where  a  >  0  is 
a  constant.  Denote 


5 

z(x)  =  X  w(i)y(x,i)  (4) 

i=l 

Our  goal  in  choosing  {w(i)}  is  to  minimize  the  variance  of  z(x). 
This  constrained  optimization  problem  can  be  solved  using  Lagrange 
multiplier  method  which  leads  to  the  solution 

which  yields  a  variance  of  z(x) 

Var{z(x)}=K  =  |3^  (6) 

This  solution  leads  to  the  second  distanced  based  weighted  voting 
method  to  combine  classifier  outputs. 

Ill  ECG  BEAT  CLASSIFICATION 

ECG  signals  are  measured  from  electrical  leads  (electrodes) 
attach  to  human  body  at  specific  locations.  Depending  on 
applications,  there  are  12-lead  system  for  short  duration  (30  seconds) 
monitoring,  or  3-lead  system  used  for  the  same  purpose.  There  are 
also  longer  term  monitoring  conducted  using  fewer  electrodes.  An 
ECG  recording  consists  of  a  sequence  of  spiky  ECG  "beats"  each 
represents  one  contraction  of  the  heart.  The  entire  episode  of  one 
heart  beat  is  characterized  by  a  sequence  of  5  "complexes",  named  as 
P,  Q,  R,  S,  and  T.  Based  on  the  relative  position  (timing)  of  these 
complexes,  the  shape  (height  and  width),  an  ECG  beat  can  be 
categorized  into  one  of  many  "labels",  and  a  segment  of  ECG  beats 
may  also  demonstrate  certain  rhythm.  By  analyzing  the  type  of 
ECG  beats,  and  accompany  rhythms,  a  trained  electro-cardiologist  is 
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able  to  diagnose  probable  causes  of  anomalies  in  patient's  heart,  so 
that  appropriate  treatment  can  be  administered. 

ECG  beat  classification  is  a  difficult  task  because  there  is  no 
such  thing  as  a  standard  ECG  beat  template.  Every  healthy  person 
has  a  slightly  different  shape  of  "normal"  ECG  beats.  The  rhythm 
and  shape  of  the  beat  also  vary  with  respect  to  time.  Sometimes  it  is 
not  only  the  shape  of  an  individual  beat,  but  its  relative  location  of 
appearance  in  a  stream  of  ECG  beat  determines  what  type  of  this 
beat  might  be. 

The  annotated  ECG  records  available  from  the  MIT/BIH 
(Massachusetts  Institute  of  Technology  and  Beth  Israel  Hospital) 
arrhythmia  database  [17]  have  been  used  in  this  study.  This  database 
has  48  records,  each  30  minutes  in  length.  The  data  were  recorded  in 
two  channels  (modified  limb  lead  II  and  modified  lead  VI)  of  surface 
ECGs  from  long-term  Holter  recorders.  They  represent  a  variety  of 
waveforms,  artifacts,  complex  ventricular,  junctional,  and 
supraventricular  arrhythmias,  and  conduction  abnormalities.  Data 
from  33  of  the  48  records  which  contain  normal  beats  and  PVCs 
were  used  for  this  study.  Classifiers  were  developed  and  evaluated 
using  subsets  of  data  from  channel  1  of  these  33  records  sampled  at 
360  Hz. 

Accompanying  each  record  in  the  database  is  an  annotation  file 
in  which  each  ECG  beat  has  been  identified  by  expert  cardiologists. 
These  labels  are  referred  to  as  ‘truth’  annotations  and  are  used  in 
developing  the  classifiers  and  also  to  evaluate  the  performance  of  the 
classifiers  in  the  testing  phase.  Data  are  extracted  as  one  feature 
vector  for  each  of  the  beats  in  all  the  selected  records.  Each  vector 
includes  one  of  the  two  possible  labels  according  to  the  AAMI 
recommended  practice.  Each  feature  vector  for  has  9  elements.  The 
first  four  feature  elements  are  temporal  parameters.  Temporal 
parameters  such  as  the  R-R  intervals  are  calculated  as  the  time 
difference  between  the  two  consecutive  QRS  peaks.  Temporal 
features  are  the  R-R  interval  between  the  current  beat  and  the 
previous  beat  (RRl),  between  the  previous  beat  and  the  one  before  it 
(RRO),  between  the  current  beat  and  the  next  beat  (RR2),  and  the 
ratio  of  RRl  and  RR2.  These  features  are  extracted  for  each 
individual  beat  in  the  database.  A  ratio  of  RRl  to  RRO  provides  an 
indication  of  an  abnormal  timing  sequence  and  helps  in  identifying 
an  abnormal  beat.  The  next  5  feature  elements  are  extracted  based  on 
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morphology.  Detailed  descriptions  of  these  features  can  be  found  in 

[20]. 


IV.  EXPERIMENTS  AND  RESULTS 

For  ECG  processing,  four  basic  statistics  are  calculated 
according  to  AAMI  (American  Association  of  Medical 
Instrumentation)  recommendation  [18]:  true  positives  (TP),  false 
positives  (FP),  true  negatives  (TN)  and  false  negatives  (FN).  This 
follows  a  detection  scenario.  True  positive  (TP)  means  a  true  event 
of  PVC  has  been  successfully  detected.  False  positive  (FP)  gives 
number  of  false  alarms,  and  false  negative  is  the  count  of  missed 
PVC  beats.  TN  and  FN  can  be  similarly  defined.  Based  on  these 
statistics,  three  performance  criteria,  sensitivity  (Sens),  Specificity 
(Spec)  and  Positive  Predictivity  (PP)  are  computed  for  each  method. 
Sensitivity:  (Sens  =  TP/  (TP  +  FN))  is  the  fraction  of  real  events  that 
are  correctly  detected;  Specificity:  (Spec  =  TN/  (TN  +  FP))  is  the 
fraction  of  false  events  detected  as  false  events;  and  Positive 
Predictivity:  (PP  =  TP/(TP  +  FP))  is  the  fraction  of  detection  that  are 
true  events.  Generally,  one  would  want  all  three  criteria  approach 
unity.  But  often  trade-offs  must  be  made.  Anaong  these  three, 
sensitivity  is  considered  most  critical,  with  specificity  and  positive 
predictivity  with  decreasing  importance.  This  is  because  missing  a 
life-threatening  ECG  beat  is  considered  more  serious  than  a  few  false 
alarms  which  can  later  be  screened  out  manually.  We  also  compute 
classification  rate  C__rate  =  (FP+FN)  /  (TP  +  FP  +  FN  +TN). 

In  order  to  improve  the  reliability  of  the  results,  while  coping 
with  the  large  volume  of  data  samples,  we  use  the  three-way  cross 
validation  method  to  create  three  different  data  sets  from  the  same 
population:  We  partition  the  original  ECG  data  randomly  into  three 
approximately  equal  sized  subsets.  (Note  that  this  is  NOT  the 
clustering  partitioning  mentioned  in  section  II).  Then  we  combine 
two  of  the  three  subset  to  make  a  training  data  set,  and  use  the 
remaining  one  as  the  testing  set.  This  is  rotated  among  these  three 
subsets  so  that  each  subset  becomes  the  testing  set  exactly  ones.  All 
the  experiments  are  repeated  three  times  (trials)  with  each  time 
applied  to  one  of  these  three  different  partitions. 

To  compare  the  performance  of  the  multiple  classifier 
approach,  we  also  conducted  a  baseline  experiment  using  a  single 
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SOM+LVQ  classifier  to  classify  the  entire  ECG  data  set.  The  results 
are  summarized  below: 


Trials 

TP 

FP 

FN 

TN 

Sens 

Spec 

PP 

C_ 

rate 

1 

1944 

130 

218 

22436 

89.92 

99.42 

93.73 

98.59 

2 

1824 

82 

387 

22434 

82.50 

99.64 

95.70 

98.10 

3 

1874 

69 

365 

22419 

83.70 

99.69 

96.45 

98.24 

Avg 

5642 

281 

970 

67289 

85.33 

99.58 

95.26 

98.31 

Table  1.  The  results  of  the  classifiers  developed  using 

clustering  algorithms  SOM  and  LVQ  and  evaluated 
using  the  nearest  neighbor  approach. 


Next,  we  use  the  multiple  classifier  approach,  with  the  winner 
decides  all  combination  rule.  The  results  are  summarized  below: 


Trials 

TP 

FP 

FN 

TN 

Sens 

Spec 

PP 

C_ 

rate 

1 

1978 

207 

183 

22358 

91.53 

99.08 

90.53 

98.42 

2 

2054 

163 

IBi 

22353 

92.94 

99.28 

92.65 

98.71 

3 

2083 

267 

156 

22220 

93.03 

98.81 

88.64 

98.29 

Avg 

6115 

637 

495 

66931 

92.51 

99.06 

90.57 

98.47 

Table  2.  The  results  of  ‘winner  decides  all’  approach  in 
evaluating  the  modular  networks  developed  by 
dividing  the  input  space  into  5  regions. 


Finally,  we  use  the  distanced  based  weighted  voting  combination 
method  and  the  results  are  summarized  in  Table  3. 


Trials 

TP 

FP 

FN 

.TN 

Sens 

Spec 

PP 

C_ 

rate 

1 

2018 

218 

143 

22347 

93.38 

99.03 

90.25 

98.54 

2 

2104 

204 

106 

22312 

95.20 

99.09 

91.16 

98.75 

178 


3 

2109 

251 

130 

22236 

94.19 

98.88 

89.36 

98.46 

Avg 

6231 

673 

379 

66895 

94.27 

99.00 

90.25 

98.58 

Table  3.  The  performance  of  the  classifiers  evaluated  using 
the  distance  based  approach. 

From  above  table,  we  observed  that  compared  to  a  single  classifier 
approach,  both  multi-classifier  combination  methods  yield  higher 
sensitivity  (94%,  93%  vs.  86%),  while  sacrifice  somewhat  on  the 
positive  predictivity.  The  overall  classification  rate  is  also  improved 
from  98.31%  to  98.47%  and  98.58%. 

References 

[1]  Bortolan,  G.,  Degani,  R.,  and  Willems,  J.  L.,  "Design  of  neural 
networks  for  classification  of  electrocardiographic  signals," 
Proc.  Annual  Inti  Conf.  IEEE  Eng.  Med.  &  Biol.  Soc.,  no.  pp. 
1467-1468,  1990. 

[2]  Cabello,  D.,  J.  M.  Salceda,  S.  Barro,  R.  Ruiz,  and  J.  Mira, 
"Statistical  techniques  for  diagnosis  of  ventricular 
arrhythmias,"  Revista  de  Informatica  y  Automatica,  vol.  22, 
no.  4,  pp.  45-52,  1989. 

[3]  Drucker,  H.,  C.  Cortes,  L.  Jacket,  Y.  LeCun,  and  V.  Vapnik, 
"Boosting  and  other  ensemble  methods,"  Neural  Computation, 
vol.  6,  no.  6,  pp.  1289-1301,  1994. 

[4]  Habboush,  I.,  G.  B.  Moody,and  R.  G.  Mark,  "Neural 
networks  for  ECG  compression  and  classification,"  in 
Proc.  Proceedings.  Computers  in  Cardiology,  pp.  185-188, 
1991. 

[5]  Hansen,  L.  K.,  and  P.  Salamon,  "Neural  network  ensembles," 
IEEE  Trans,  on  PAMI,  vol.  12,  no.  10,  pp.  993-1001,  1990. 

[6]  Ho,  T.  K.,  J.  J.  Hull,  and  S.  N.  Srihari,  "Decision  combination 
in  multiple  classifier  systems,"  IEEE  Trans,  on  PAMI,  vol.  16, 
no.  1,  pp.  66-76, 1994. 

[7]  Hu,  Y.  H.,  W.  J.  Tompkins,  J.  L.  Urrusti,  and  V.  X.  Alfonso,, 
"Applications  of  artificial  neural  networks  for  ECG  signal 
detection  and  classification,,"  Journal  of  Electrocardiology, 
no.  pp.  Accepted,  in  press,  1994. 

[8]  Hu,  Y.  H.,  Palreddy,  S.,  and  W.  J.  Tompkins,  "Patient 
Adaptable  ECG  Beat  Classification  using  Mixture  of  Experts," 


179 


in  Neural  Network  for  Signal  Processing  V,  Ed(s).,  IEEE, 
1995. 

[9]  Jacobs,  R.,  "Method  for  combining  experts'  probability 
assessments,"  Neural  Computation,  vol.  7,  no.  5,  pp.  867-888, 
1995. 

[10]  Jacobs,  R.  A.,  M.  I.  Jordan,  S.  Nowlan,  and  G.  E.  Hinton, 
"Adaptive  mistures  of  local  experts,"  Neural  Computation, 
vol.  3,  no.  pp.  79-87,  1991. 

[1 1]  Jenkins,  J.  M., .  and  Arzbaecher,  R.,  "On-line  computer  pattern 
recognition  and  classification  of  ventricular  and 
supraventricular  arrhythmias,"  in  Progress  in 
Electrocardiology,  Macfarlane,  Ed(s).,  Glasgow:  Pittman 
Medical,  1979. 

[12]  Jordan,  M.  L,  and  R.  A.  Jacobs,  "Hierarchical  mixtures  of 
experts  and  the  EM  algorithm,"  Neural  Computation,  no.  pp. 
1993. 

[13]  Klingeman,  J.,  and  Pipberger,  H.  V.,  "Computer  classification 
of  electrocardiograms,"  Comp.  Biomed.  Res.,  vol.  1,  no.  pp.  1, 
1967. 

[14]  Kohonen,  T.,  "The  Self-Organizing  Map,"  Proc.  IEEE,  vol. 
78,  no.  9,pp.  1464-1480,  1990. 

[15]  Krogh,  A.,  and  J.  Vedelsby,  "Neural  network  ensembles,  cross 
validation  and  active  learning,"  in  Advnaces  in  Neural 
Information  Processing  Systems  7,  Ed(s).,  Cambridge  MA: 
MIT  Press,  1995. 

[16]  Linnenbank,  A.  C.,  Groenewegen,  A.  S.,  and  Grimbergen,  C. 
A.„  "Artificial  neural  networks  applied  in  multiple  lead 
electrocardiography:  Rapid  quantitative  classification  of 
ventricular  tachycardia  QRS  integral  pattern,"  Proc.  Annual 
Inti.  Conf.  IEEE  Eng.  Med.  &  Biol.  Soc.,  no.  pp.  1461-1462, 
1990. 

[17]  Mark,  R.,  and  G.  Moody,  "MIT-BIH  Arrhythmia  Database 
Directory",  MIT,  1988. 

[18]  Mark,  R.  and  R.  Wallen,  "AAMI  Recommended  Practice: 
Testing  and  Reporting  Performance  Results  of  Ventricular 
Arrhythmia  Detection  Algorithms",  Association  for  the 
Advancement  of  Medical  Instrumentation,  AAMI  ECAR-1987, 
1987. 

[19]  Meir,  R.,  "Bias,  variance,  and  the  combination  of  estiamtors: 
the  case  of  least  linear  squares,"  in  Advances  in  Neural 
Information  Processing  Systems  7,  Ed(s).,  Cambridge,  MA: 
MIT  Press,  1995. 


180 


[20]  Palreddy,  S.,  "ECG  Beat  Classification,"  Ph.  D.  Dissertation, 
Univ.  of  Wisconsin-Madison,  1996. 

[21]  Pedrycz,  W.,  G.  Bortolan,  and  R.  Degani,  "  Classification  of 
electrocardiographic  signals:  a  fuzzy  pattern  matching 
approach,"  Artificial  Intelligence  in  Medicine,  vol.  3,  no.  4, 
pp.  211-226,  1991. 

[22]  Pipberger,  H.  V.,  A.  S.  Berson,  J.  D.  Klingeman,  and  C.  D. 
Batchlor,  "Diagnostic  classifications  of  orthogonal 
electrocardiograms  and  vectorcardiograms,"  in  Proc. 
International  Vectorcardiogram  Symposium,  pp.  157-163., 
1971. 

[23]  Seung,  H.  S.,  M.  Opper,  and  H.  Sompolinsky,  "Query  by 
committee,"  in  Proc,  5-th  workshop  on  computational  learning 
theory,  Ed(s).,  San  Mateo,  CA:  Morgan  kaufmann,  pp.  287- 
294,  1992. 

[24]  Tresp,  V.,  and  M.  Taniguchi,  "Combining  estimators  using 
non-constant  weighting  functions,"  in  Advances  in  Neural 
Information  Processing  Systems  7,  Ed(s).,  Cambridge  MA: 
MIT  Press,  1995. 

[25]  Tsai,  Y.  S.,  Hung,  B.  N.,  and  Tung,  S.  F.,  "An  experiment  on 
ECG  classification  using  back-propagation  neural  network," 
Proc,  Annual  Inti.  Conf.  IEEE  Eng,  Med.  &  Biol.  Soc.,  no.  pp. 
1463-1464,  1990. 

[26]  Turner,  K.,  and  J.  Ghosh,  "Error  correlation  and  error  reduction 
in  ensemble  classifiers,"  Connection  Science,  special  issue  on 
combining  neural  networks,  no.  pp.  (to  appear),  1996. 

[27]  Wolpert,  D.  H.,  "Stacked  Generalization,"  Neural  Networks, 
vol.  5,  no.  2,  pp.  241-259,  1992. 

[28]  Xu,  L.,  and  M.  I.  Jordan,  "EM  learning  on  a  generalized  finite 
mixture  model  for  combining  multiple  classifiers,"  in  Proc. 
Proc,  World  Congress  on  Neural  Networks,  Portland,  OR,  vol. 
IV,  1993. 

[29]  Yeap,  T.  H.,  Johnson,  F.,  and  Rachniowski,  "ECG  beat 
classification  by  a  neural  network,"  Proc.  Annu.  Inti  Conf. 
IEEE  Eng.  Med.  &  Biol.  Soc.,  no.  pp.  1457-1458,  1990. 


181 


Applying  Neural  Networks 
TO  Adjust  Insulin-Pump  Doses 


Fidimahery  Andrianasy  Maurice  Milgram 

Laboratoire  PARC,  Tour  66,  2eme  tage, 
Universite  Paris  6,  4  Place  Jussieu,  75  005  Paris,  France 
{f  idi ,  milgram}  Orobo .  jussieu .  f  r 


Abstract.  Programming  appropriate  insulin-dose  levels  is  a  com¬ 
mon  diabetic  pump-user  problem.  We  developed  a  neural-network 
advisory  system  that  suggests  the  appropriate  next-time  insuhn 
dose  based  on  short  historical  discontinuous  blood-glucose  mea¬ 
surements  and  insulin  doses  settings.  Diabetologists’  high  level 
decision  taking  process  have  been  succesfully  learned.  Our  data 
base  consists  of  25000  recorded  data  from  747  insulin- pump 
users  tmder  medical  supervision.  The  efficient  data  concept  is  in¬ 
troduced.  Training  with  efficient  learning  data  allowed  to  achieve 
very  good  generalization.  A  portable  neural-network  controlled 
insulinrpump  device  is  beeing  designed.  A  complete  insulin  ad¬ 
visory  system  including  our  algorithm  is  currently  under  clinical 
test.  Preliminary  results  demonstrate  that  the  performances  of 
the  neural- networks  are  equivalent  to  those  of  the  physician. 


1  Introduction 

An  insulin  pump  is  a  miniature  pump  which  delivers  a  continuous  supply  of 
insulin  (Basal  rate),  with  extra  insulin  administered  for  meals  (Prandial  or 
Bolus  rate)  to  an  insulin-dependant  diabetic  (IDD).  This  is  the  best  known 
system  that  can  closely  mimics  the-body’s  pancreas  normal  release  of  insulin. 
Unfortunately,  blood-glucose  (BG)  levels  have  to  be  monitored  before  one 
could  make  good  adjustments  in  insulin,  food,  and  exercises  in  response  to 
those  glucose  test  results.  BG  variations  depend  on  several  factors  and  vary 
with  time.  Deciding  the  amount  of  injection  is  a  difficult  task  because  mor¬ 
phology,  future  physical  activity,  time  of  meal,  meal  content,  present  glucose 
concentration  and  results  of  the  previous  day  have  to  be  taken  in  account. 
Moreover,  injected  insulin  acts  with  delay  and  its  efficiency  reduces  as  BG 
level  gets  higher. 

Our  paper  deals  with  the  specific  problem  of  how  to  predict  accurately  the 
insulin  dosage  of  an  insulin-pump  .  Diabetologists’  knowledge  are  generally 
in  heuristic  form.  The  medical  staff  at  Strasbourg’s  Hospital  applies  a  typical 
two  phases  scheme  in  order  to  control  the  BG  level  of  an  IDD  patient  under 
pump  treatment:  (1)  at  the  beginning  of  each  day,  the  diabetologist  prescribes 
an  injection  rate  profile  for  the  next  24  hours,  taking  into  account  the  present 
BG  concentration  and  the  results  of  previous  day  injections;  (2)  during  the 
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Fig.  1.  A  pump-user  glycemia  responses  to  insulin  doses  (pump  settings  £ow  rates). 
Note  that  pump  settings  are  constant  between  successive  adjustments  and  the  dis¬ 
continuous  measurements  are  not  taken  at  Gxed  intervals.  Given  historical  data  of 
a  specific  patient,  the  problem  is  to  find  appropriate  doses  D{t)  that  would  induce 
'‘normaF  next-time  glycemia  levels  G{t  +  1).  “Efficient  facts’’  are  selected  based 
on  the  quality  index  \G(t  +  1)  —  1|. 


next  24  hours,  the  paramedical  staff  monitors  and  fine  tunes  the  insulin  levels 
so  as  to  stay  into  the  ideal  BG  target  [i.e.lg/l).  Ten  measurements  and  tuning 
per  24  hours  are  then  performed. 

A  preliminary  inspection  of  the  data  shows  a  strong  correlation  between 
the  profiles  of  insulin  advisories  for  two  consecutive  days.  We  propose  to  use 
a  neural  model  to  predict  the  next-time  insulin  level  based  on  short  history 
(previous  hours)  and  same-time  of  the  previous-day  data.  Our  aim  was  to 
extrapolate  from  physician’s  experiences  hidden  in  the  available  clinical  data. 

Neural-network  approach  has  been  successfully  applied  to  various  areas 
of  medicine,  such  as  diagnostic  aides,  biochemical  analysis,  image  analysis, 
and  drug  development.  But,  no  known  results  have  been  announced  on  the 
use  of  backpropagation  networks  to  drive  an  insulin-pump  [1]  especially  for 
the  d, is  continuous  measurements  and  infusion.  Lakatos  et  al.  tried  to  simulate 
the  specialists’  reasoning  by  predicting  first  the  BG  with  a  neural  network; 
they  had  to  built  another  network  using  one-hour  time  resolution  to  ensure 
continuous  data  consideration  and  cubic  spline  interpolation  to  generate  BG 
profile  [2]. 

Intravenous  glucose  controlled  insulin  infusion  was  achieved  until  now  b}^ 
using  continuous  BG  measurements  [3]  [4],  which  does  not  make  possible 
a  routine  use  in  the  treatment  of  diabetic  patients.  Meanwhile,  numerous 
studies  have  been  conducted  on  closely  related  fields.  Much  of  the  efforts 
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were  aimed  at  the  prediction  the  BG  level  and  the  identification  of  the  glucose 
metabolism  [5],  [6].  Blumenfeld[7]  for  example  described  networks  that  when 
presented  with  the  serum  glucose  and  pump  settings  at  time  steps  t  and  t+1, 
are  capable  of  predicting  the  serum  glucose,  and  suggesting  the  pump  setting 
at  time  t  +  2.  Unfortunately,  many  theoretical  and  practical  obstacles  remain 
since  ideally  the  network  used  should  interpret  all  data  as  a  continuous  stream, 
and  all  measurements  must  be  taken  at  fixed  intervals.  Most  of  these  works 
rely  upon  classical  signal  processing  techniques. 

The  problem  of  insulin-advisory-strategy  modelling  seems  to  be  less  hard 
to  tackle  than  the  glucose  metabolism  identification  one.  Insulin  flow  pattern 
and  human  blood- glucose  response  are  highly  time-dependent.  In  our  case, 
measurements  are  not  taken  at  fixed  intervals  (called  here  periods  )  even 
though  they  were  performed  at  fixed  time.  We  suggest  using  a  multi-network 
architecture  system,  where  each  neural  network  is  dedicated  to  one  period 
of  time.  The  multi-networks  architecture  provides  an  elegant  way  to  resolve 
the  time-dependency  problem  since,  training  a  neural  network  under  these 
conditions  amounts  to  finding  a  static  function  of  its  inputs. 

On  these  assumptions,  it  is  the  belief  of  the  authors  that  standard  back- 
propagation  neural  networks  are  sufficiently  powerful  tools  for  the  problem  of 
predicting  the  insulin  doses  levels,  provided  enough  complexity  of  the  neural 
model,  i.e.  enough  number  of  neurons  are  used  in  the  hidden  layer  of  each 
neural  network.  The  “efficient  data  learning”  concept  was  applyed  to  enforce 
the  fact  that  only  the  best  physicians’  decisions  according  to  a  given  quality 
index,  are  taken  in  account.  One  must  bear  in  mind,  that  the  proposed  sys¬ 
tem  requires  a  good  initialization  scheme  at  the  very  first  time  of  operation 
when  no  previous  data  are  available. 


2  Neural-network  architecture 

The  goal  of  this  study  is  to  build  a  neural  network  model  of  the  empirical 
rules  devised  by  health  care  teams  for  choosing  the  right  insulin  dosage,  with 
neural  networks  system.  The  objective  is  to  maintain  the  BG  level  at  the 
constant  value  of  Ig/l.  Our  database  come  from  unpublished  data  from  747 
patients  under  medical  pump-treatment.  Daily  records  of  the  patients  BG 
and  the  insulin  level  infusion  have  been  collected.  There  are  10  records  per 
24  hours  corresponding  to  10  fixed  measurements  times:  [/ii, ...,  =  [1/2-30, 

4/200,  7/i30,  9/i00,  12/iOO,  13/i30,  16/^00,  19/i00,  20/i30,  23/2-00].  We  suppose 
that  11  records  are  already  available,  that  is,  the  patient  have  been  monitored 
for  at  least  24  hours.  The  case  of  the  11  first  predictions  will  be  treated  in 
detail  within  a  next  publication. 

The  meal  time  hours  {/2-3,  /is,  /^s}  are  distinguished  from  the  7  remaining 
basal  hours.  The  entire  raw  data  set  consists  of  25,000  facts.  In  addition  to 
the  current  and  next  time  BG  and  insulin  dosage,  each  record  includes  the 
time  of  day,  the  age,  the  weight(W),  the  meal  contents  and  the  Bmi  factor. 
The  Bmi  factor  is  defined  as  Bmi  =  \Vf  where  H  is  the  patient  s  height. 
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Throughout  this  paper,  if  t  is  the  current  time  {t  =  hi  €  {/ii, /iio}),  then 
{t  +  1)  refers  to  the  next  canonical  time,  hi+i  with  /iio+i  =  hi. 


Symbol 

Range 

Input  features 

Gt 

i-i,+ii 

The  current  blood-glucose  level 

Gt-i 

[-1,+1] 

The  {t  —  1)  blood-glucose  level 

Gt-2 

[-1.+1] 

The  it  —  2)  blood-glucose  level 

S 

[0,+ll 

Extra  sugar  (during  hypoglycemia) 

I 

[0,+l] 

Extra  (flash)  injection  (during  hyperglycemicvs) 

Gt-io 

The  previous  day  blood-glucose  level 

Gt-9 

[~lj  +1] 

The  resulted  previous  day  blood-glucose  level 

Dnt -10 

[— Ij+l] 

The  previous-day  pre-normalized  dose 

Dt-io 

[-1,+1] 

The  previous  day  non-normalized  dose 

Dm 

(-i,+ii 

The  mean  of  Basal  injection  since  t  ~  1 

Dt-i 

[-1,+1] 

The  previous  time  (t  —  1)  injection 

Dt-2 

[~1, +1] 

The  injection  at  time  {t  —  2) 

Table  1.  Input  features. 

The  model  has  been  decomposed  into  10  backpropagation  neural  networks 
[8]  parts,  named  nuhi-  Each  neural  networks  have  been  trained  separately 
using  data  associated  to  each  canonical  hours.  Only  cases  in  which  insulin 
prescriptions  were  efficient  were  included  into  each  training  set.  Efficient 
learning  data  regroups  the  subset  of  the  learning  facts  that  have  induced  a 
normal  BG  level,  i.e.  0.9^//  <  BG  <  1.3^?//.  Only  the  10-th  and  above  facts 
for  each  patient  are  taken  into  account  because  the  model  need  the  previous 
day  values.  The  number  of  features  (12)  sets  the  size  of  the  networks  input 
layer  (see  Tablet) 

Let  (xit)  (called  the  CK-factor),  be  the  linear  regression  coefficient  between 
the  basal  rate  vector  [D(l),  ...  ,  D{t)]  and  the  corresponding  standard  basal- 
rates  vector  [Z?gtd(l).  ■C’stdWl-  ^he  DgtdC^i)  (*  ^  {!>  4,  6,  7,  9, 

10}),  are  standard  b»sal-rates  on  time  t  i.e.  i)g^^(t)  =  =  t).  In  this 

paper  D^^^{hi)  =  [0.8,  0.8,  1.2,  1.6,  1.6,  1.3,  1.3],  These  are  provided  by 
diabetologists.  The  following  formula  was  used  to  evaluate  the  a-factor, 

«(*)  =  E  /  E  (%d(*))^ 

i=l,  ...  ,t  i=l,  ...  ,t 

The  injection  dose  D{t)  was  normalized  using  the  coefficient  Q'(i)  yielding 
Dn{t)  =  D{t)  —  Q'(/;) .  Z)g^(^(t).  This  pre-normalization  was  only  applied  to 
the  previous-day  injections  D{t—  10). 

The  output  is  the  current  insulin  injection  rate  Do,  normalized  to  the 
range  [— 1,-hl].  The  most  successful  backpropagation  network  architecture 
had  12  inputs,  4  cells  in  the  hidden  layer  and  one  element  in  the  output  layer 
(12  :  4  :  1).  Then  the  entire  system  is  composed  of  10  x  (12  :  4  :  1)  networks. 
The  trained  networks  have  been  tested  using  unseen  patterns  consisting  of 
about  one-half  of  the  entire  data  set. 
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Neural 

Network 

Canonical 

hours 

Mean  Errors 
selected  data  raw  data 

nn/7i 

OlhSO 

0.1540 

0.1594 

nn;72 

OAhOO 

0.1152 

0.1219 

nn/i3 

07h30 

0.1085 

0.1046 

nn^4 

09h00 

0.0892 

0.1066 

nuhb 

12h00 

0.0945 

0.1273 

nn/i6 

I3h30 

0.0713 

0.0886 

nn/i7 

16h00 

0.0641 

0.0841 

nnh8 

IQhOO 

0.0966 

0.1158 

rmhQ 

20/i30 

0.0896 

0.1005 

nn/iio 

23h00 

0.0616 

0.0674 

Table  2.  Each  row  exhibits  the  performance  of  neural  networks  niihi,  i  =  1, 10 
associated  to  canonical  hours  hi.  Relatively  small  errors  in  the  third  column  show 
that  the  networks  generalize  well  on  unseen  data.  The  fourth  column  displays  rel¬ 
ative-errors  when  the  system  is  tested  against  non-filtered  data.  Error  rates  are 
similar  to  those  of  the  previous  case.  Larger  error  rates  during  the  early  hours  of 
the  day  suggests  using  neural  networks  with  larger  hidden  layer  for  these  periods 
in  order  to  handle  the  increased  complexity  of  the  model. 


3  Results 

Results  on  the  test  set  are  expressed  in  terms  of  mean  of  relative  error 
^77?.  =  Ylk  \i^o  ~  ^o)  /  ^o|)  where  Dq  is  the  network’s  output  for  the 
input  pattern  k  and  Dq  is  the  desired  output.  The  training  were  stopped  af¬ 
ter  10000  presentations  (learning  rate  =  0.0002)  and  good  performances 
were  obtained  as  observed  in  Table  2. 

Each  row  of  the  table  corresponds  to  a  neural  network  specialized  in  the 
prescription  of  insulin  dose  at  a  specific  canonical  hour.  The  second  column 
shows  networks’  performances  in  generalization  when  only  efficient  facts  were 
selected.  The  errors  in  the  third  column  were  obtained  using  raw  unfiltered 
data.  As  could  be  expected,  the  test  conducted  with  the  efficient  facts  were, 
better  than  those  with  the  raw  data.  The  relatively  small  differences  between 
these  two  tests  clearly  shows  that  the  trained  networks  have  successfully 
extracted  the  actual  rules  devised  by  the  diabetologists. 


4  Conclusions 

Pump  therapy  is  for  people  who  have  insulin-dependent  diabetes,  who  are 
able  to  monitor  their  blood  glucose  values  and  operate  the  pump  themselves. 
But,  deciding  the  amount  for  insulin  injection  is  very  difficult.  We  proposed  a 
neural  network-based  system  that  predicts  the  appropriate  next  insulin  doses 
level  for  an  insulin-pump,  given  short  historical  discontinuous- measurements 
of  BG  levels  and  insulin  doses. 
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Fig.  2.  Typical  Debnet  predictions  (dashed  lines)  vs  actual  insulin  injected  (solid 
lines.)  Outputs  from  the  second  day  only  have  been  plotted. 


Our  approach  allowed  us  to  model  the  empirical  rules  devised  by  health¬ 
care  teams  when  computing  the  insulin  level  infusion  of  an  insulin-pump 
device.  Neural  networks  trained  with  efficient  data  provided  good  accuracy 
in  predicting  appropriate  insulin  levels.  The  behavior  learned  from  these  facts 
extrapolates  very  well  for  unseen  data. 

Results  suggest  that: 

•  finding  a  viable  control  strategy  in  the  particular  case  of  blood-glucose 
level  control  is  feasible  even  if  the  available  observed  data  is  discontinuous 
in  nature  and  sampled  irregularly  over  time;  in  the  past,  intravenous 
glucose  controlled  insulin  infusion  was  only  achieved  by  using  continuous 
BG  measurements; 

•  standard  back-propagation  neural  networks  are  powerful  enough  to  model 
the  heuristic  rules  devised  by  experienced  health-care  team  when  to  de¬ 
cide  the  amount  of  insulin  doses  applied  at  each  specific  circumstances; 

•  the  normal  body  insulin  release  seems  to  be  a  strongly  time-dependant 
process,  with  a  varying  time-constant  depending  on  the  period  of  the 
day;  this  time-constant  roughly  ranges  from  10  to  15  minutes  around 
meal-time  and  more  than  3  hours  during  the  sleeping  period  of  the  night; 

•  the  multi-networks  approach  appears  to  be  the  most  natural  way  to 
handle  the  variable  time-constant  problem  since  the  number  of  different 
processes  to  be  modelled  remains  low; 

•  training  back-propagation  neural  networks  with  efficient-data  seems  to 
enhance  generalization  performances. 
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A  software  prototype  called  DebNet  was  put  under  clinical  test.  Pre¬ 
liminary  reports  are  very  encouraging  [9]  since  the  health-care  team  almost 
never  had  to  intervene  during  its  operation  (see  Figure  2).  In  this  version 
DebNet  used  a  simple  algorithm  to  initialize  the  system  Due  to  the  lack  of 
space  we  will  enter  in  the  details  of  the  implemented  initialization  scheme 
within  a  next  publication.  A  portable  pump  controlled  by  DebNet  is  also 
being  designed.  Such  device  (insulin-pump  -f-  DebNet  )  would  help  hospi¬ 
tals  cut  costs  by  providing  faster  and  accurate  prescriptions  with  fewer  costly 
specialists. 
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Abstract 

In  this  paper  we  propose  a  multimodal  perceptron  tree 
(MMPT)  neural  network  to  segment  magnetic  resonance  (MR)  im¬ 
ages.  The  architecture  consists  of  simple  networks  -  neurons, 
hierarchically  connected  in  a  tree  structure.  The  latter  is  built  up 
during  training  by  the  adopted  depth-first  searching  technique  aug¬ 
mented  with  choosing  the  best  hyperplane  split  of  the  feature  sub¬ 
space  at  each  tree  node.  This  neural  network  effectively  partitions 
the  feature  space  into  subregions  and  each  terminal  subregion  is 
assigned  to  a  class  label  depending  on  the  data  routed  to  it.  As  the 
tree  grows,  the  number  of  training  data  for  each  node  decreases, 
which  results  in  less  weight  update  epochs  and  decreases  the  time 
consumption. 

The  MMPT  performance  is  compared  to  that  of  a  multilayered 
perceptron  (MLP).  The  networks  are  applied  to  brain  MR  image 
segmentation  into  gray  matter/white  matter  regions. 


1.  INTRODUCTION 

If  the  analysis  of  brain  MR  images  could  be  automated,  it  would  provide  ac¬ 
curate  and  reproducible  results  as  well  as  relief  human  experts  from  time  consum¬ 
ing  tasks  [1].  MRI  has  been  exploited  for  noninvasive  diagnosis  as  well  as  for  tis¬ 
sue  identification  for  surgical  planing  and  for  interpreting  other  images  such  as 
positron  emission  tomography.  In  order  to  apply  MR,  we  often  have  to  introduce 
a  reasonable  segmentation  technique.  Neural  networks  may  provide  us  with  superi¬ 
or  solutions  for  the  pattern  classification  of  medical  images,  than  the  conventional 
methods.  Neural-network-based  segmentation  of  MR  images  according  to  the  tis¬ 
sue  characteristics  has  significant  meaning  for  neoplasms  diagnosis,  3D  display  of 
human  organs  during  surgery  simulation  and  MR  -  based  deconvolution  of  PET 
images  [2,  3]. 

In  this  paper  we  combine  the  concept  of  decision  tree  classifiers,  presented 
in  [4,  5]  with  the  popular  network  structure  of  MLP.  The  basic  idea  behind  the  tree 
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classifiers  is  to  organize  the  class  assignment  into  a  tree  search,  which  is  guided  by 
properly  chosen  questions  based  on  the  feature  vector.  In  the  conventional  systems 
the  node  questions  are  represented  as  internal  nodes  of  the  tree  and  the  class  labels  - 
as  leaves.  In  case  of  many  feature  vectors,  a  decision  tree  will  check  one  attribute 
at  a  time  which  means  restricted  orientation  of  hyperplanes.  This  can  also  cause 
the  growth  of  unreasonably  large  trees  [6].  The  proposed  neurons  consider  a  com¬ 
bination  of  feature  attributes  thus  greatly  simplifying  the  tree  structure. 

The  MLP  has  been  successfully  applied  in  many  image  processing  tasks. 
However,  it  still  has  drawbacks  which  make  the  construction  of  MLP  more  of  an 
art  than  science.  The  MMPT  attempts  to  eliminate  some  of  the  MLP's  shortcom¬ 
ings,  e.g.  the  need  to  specify  the  architecture  in  beforehand,  choice  of  activation 
function  in  order  to  speed  up  the  convergence  and  to  avoid  the  notorious  local  min¬ 
ima  [7,  8]. 

Other  related  method  exists  in  the  field  of  speech  processing  [9,  10].  That 
method  is  discussed  in  Section  4  of  this  paper  in  comparison  to  the  proposed  meth¬ 
od. 


2.  MMPT  LEARNING  ALGORITHM 

The  MMPT  is  grown  through  training  and  therefore  the  number  of  units 
need  not  be  specified  in  beforehand.  At  every  tree  node,  a  combination  of  feature 
vectors  is  used  to  form  a  splitting  hyperplane. 


Fig.l  Multimodal  Perceptron  Tree  neural  network  architecture:  a)  architecture  of  a  neu¬ 
ron  as  a  part  of  MWPT  network;  b)  example  of  a  full  grown  3-way  MWPT 


The  tree  architecture  is  shown  in  Fig.l.  A  neuron  consists  of  a  simple  net- 
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work  with  no  hidden  units  as  illustrated  in  Fig. la.  A  matured  tree  for  3-way  clas¬ 
sification  is  shown  in  Fig. lb.  The  squares  are  leaves,  i.e.  classes  to  which  the  data 
belong  after  the  training  is  completed. 

The  growing  procedure  starts  with  training  the  root  node  -  the  #1  in  Fig. lb, 
over  all  training  data.  The  classes  are  labeled  using  binary  vectors  and  the  data  are 
classified  according  to  a  winner-take-all  rule: 

sample  e  class  i  if  f(wi.x)  >  f(wj.x),  Vj^^i 

where  w  is  the  weight  vector,  x  represents  the  input  vector  and  f  is  the  activation 
function.  After  training  the  nodal  neural  network,  the  winner-take-all  divides  the 
feature  space  into  three  convex  regions.  The  training  data  in  each  region  is  as¬ 
signed  to  a  different  child  node.  If  the  data  represent  only  one  class,  then  the  node 
turns  into  a  leaf  signifying  this  particular  class.  In  case  of  handling  misclassified 
data,  the  training/splitting  procedure  takes  place  at  the  particular  child  node  thus  de¬ 
creasing  the  number  of  training  exemplars  for  the  next  layer  of  child  nodes.  In  this 
way  the  training  data  set  is  divided  and  subdivided  in  recursive  fashion  until  the  leaf 
nodes  representing  the  actual  classes  are  reached.  Therefore  the  number  of  training 
examples  decreases  with  increasing  of  the  tree  depth.  This  leads  to  reduced  time 
consumption  compared  to  MLP  because  of  the  MLP's  moving  target  problem,  i.e. 
the  weights  are  changing  at  once  and  each  hidden  unit  sees  a  continuously  changing 
environment.  Instead  of  moving  quickly  to  assume  useful  roles  in  the  problem  so¬ 
lution,  the  hidden  units  engage  in  many  wasted  motions. 

The  neurons  apply  a  sigmoid  activation  function  as, 

f= — 1 — 

1  +  e-*  (2) 

with  the  output  range  ( 0,  1  )  and  are  trained  independently  of  each  other  by  a  gradi¬ 
ent  decent  method  with  respect  to  the  least  mean  square  error  (L2  norm) 

Ei  =  ( ti  -  yj  )2,  (3) 

where  tj  is  the  required  output  for  pattern  i  and  yi  is  the  actual  neuron  output.  The 
weights  are  adjusted  according  to: 

wj'+*  =  W?  -TI  yj  (  1  -  yi )  ( ti  -  yi )  Xj  +  a  (  wf  -  wf''  )  (4) 

0  <  r|  <  1  and  0  <  a  <  1 , 

where  w*^  is  the  weight  from  the  i  -  th  input  xj  to  the  neuron  at  iteration  n,  y\  is 

the  training  coefficient  and  a  is  so  called  momentum  coefficient.  The  weight  up¬ 
date  procedure  will  be  stopped  when  the  average  error  does  not  decrease  beyond  a 
small  threshold  level  over  some  time  span. 
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The  momentum  term  (  ot  (  wf  -  wf  )  )  \s  added  to  speed  up  the  conver¬ 
gence.  The  training  procedure  stops  if  no  further  splitting  is  necessary.  During 
testing  the  data  are  fed  into  the  root  neuron  and  directed  according  to  (1)  to  the  re¬ 
spective  child-neurons.  The  ideal  response  of  the  perceptrons  will  be  0  and  1,  how¬ 
ever,  in  practice,  we  set  the  reaction  registration  thresholds  to  0. 1  and  0.9. 

T^is  is  a  description  of  a  general  tree  growing  algorithm.  The  characteris¬ 
tics  which  set  off  the  proposed  algorithm  are  the  depth-first  searching  technique  and 
the  estimation  of  the  bets  fit  neuron  to  split  the  feature  subspace  for  each  particular 
node.  The  chosen  searching  technique  is  guaranteed  to  find  deep  solutions,  as  it 
searches  every  branch  to  the  final  classifying  split  in  vertical  fashion,  which  is  im¬ 
portant  in  case  of  binary  or  small-class  (three-  to  four-way  branching)  trees  and  also 
in  cases  of  highly  non-linear  separable  problems  as  the  intertwined  spirals  [11]. 
However,  regardless  of  the  type  of  searching  technique,  a  randomly  chosen  percep- 
tron  may  offer  poor  division  of  the  feature  space,  thus  providing  its  child-nodes 
with  input  information,  which  is  inseparable  or  which  will  prolong  and  make  the 
solution  finding  complicated  [12].  This  is  the  reason  why  we  introduce  a  measure 
for  estimating  the  most  proper  neuron  for  every  decision  making  node  in  the  tree. 

The  examination  procedure  includes  the  training  of  the  node  until  receiving 
its  output.  The  input  data  set  includes  one  or  more  input  values  per  sample 
(Fig.2a),  which  are  the  outputs  of  the  parent-nodes  for  this  particular  sample.  This 
is  valid  only  for  child-nodes,  as  the  root  node  is  taken  at  random.  We  then  form 


the  function 


m-1 


0=0 


n- 1 

I 


s=0 


(Ys  ■■  y)  (Eo.s  ■  Eq) 


(5), 


where  ys  is  the  output  for  particular  input,  Y  is  averaged  value  over  all  samples. 


Eo,s  is  the  error  for  sample  s  at  output  o,  Eq  is  the  averaged  value  over  all  sam¬ 
ples,  m  is  the  number  of  outputs,  and  n  is  the  total  number  of  training  samples. 
The  purpose  is  to  obtain  a  node  with  maximum  which  will  mean  maximum 
magnitude  of  correlation  between  a  unit's  value  and  the  residual  error  Eq,  estimated 

at  output  0.  Instead  of  creating  and  examining  node  by  node  we  set  a  pool  of  can¬ 
didates  with  different  initial  weights  and  examine  the  whole  pool  for  the  best  fit 
candidate  according  to  the  described  criterion.  When  such  neuron  is  found,  it  is 
adopted  in  the  network  architecture.  With  deepening  of  the  tree,  each  node  takes 
one  more  input,  i.e.  the  input  space  dimensions  are  increased,  which  makes  the 
task  of  classification  easier  [13].  A  flowchart  of  the  algorithm  is  shown  in 
Fig.2b.  In  Table  1  we  have  shown  the  results  for  the  intertwined  spirals  task  for 
the  proposed  method  and  the  one  presented  in  [9,  10].  No  pruning  was  applied  in 
both  cases.  With  this  result  it  can  be  concluded  that  the  proposed  method  demon¬ 
strates  better  classification  abilities  and  therefore  the  introduced  measure  and  the  ap¬ 
plied  searching  technique  present  better  fit  method  for  the  purposes  of  image  clas¬ 
sification. 
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Fig.  2  The  tree  growing  algorithm:  a)  scheme  for  applying  the  measure  for 
choosing  best  fit  neuron;  b)  flowchart  of  the  method 


TABLE  1  PERFORMANCE  COMPARISON  BETWEEN  MMPT  AND 
STATE-OF-ART  METHOD  ON  INTERTWINED  SPIRALS  TASK 


MMPT 

Zrida,  Mammone  [9,  10] 

correct  classification 

88% 

63% 

3.  SEGMENTATION  RESULTS 

The  brain  MR  image  to  be  segmented  is  shown  in  Fig.3a.  It  is  160  x  160 
pixels,  256  shades  of  gray  scale  image.  The  information  contained  in  this  image 
can  be  divided  basically  into  three  classes  -  the  cerebro-spinal  fluid  (CSF),  muscles, 
gray /white  matter.  The  aim  is  to  recognize  the  white  matter  and  gray  matter  re- 
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gions,  thus  the  result  can  be  used  as  a  priori  knowledge  in  the  processing  of  posi¬ 
tron  emission  tomography  (PET)  images  [3]. 


I  I  I  I  I  I  I  I  I 
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b) 

Fig.3  MR  image  to  be  segmented:  a)  MRI  gray  scale  160x160  pixel;  b)  parti¬ 
tioned  image  into  20x20  blocks 


The  procedure  for  classification  is  divided  into  two  steps  -  in  Step  1,  we  sin¬ 
gle  out  the  muscles  and  CSF  portion  as  unimportant  information.  In  order  to 
proceed  with  Step  1  we  partition  the  image  as  shown  in  Fig.3b,  which  leaves  us 
with  400  blocks  to  classify  as  non-brain  tissue,  border  blocks,  i.e.  containing  brain 
tissue  as  well  as  non-brain  tissue,  and  gray/white  blocks.  We  divide  the  brain  tis¬ 
sue  in  this  way,  since  the  data  require  different  input  feature  vectors  to  be  classified 
correctly  by  the  neural  network. 

Step  2  is  responsible  for  the  pixel-by-pixel  classification  into  the  white 
matter/gray  matter  regions.  This  step  is  also  applied  to  MMPT  network. 

The  drawback  of  having  two  steps  is  the  requirement  for  more  than  97% 
correct  classification  at  least  in  Step  1.  On  the  other  hand  such  division  of  the  pro¬ 
cess  allows  precise  classification  of  the  meaningful  information  -  the  brain  ttssue. 

The  described  process  is  summarized  in  Fig.4,  where  the  ellipsoids  signify 
the  data  subject  to  further  processing,  and  the  rectangles  contain  the  final  result  to 

be  achieved.  ^  j  •  c*  i 

MMPT  network  with  5  inputs  and  3  outputs  was  implemented  in  btep  i. 

The  input  feature  vectors  were  determined  as  follows : 

-averaged  pixel  value  over  a  block  {  avrg  =  Zb  b  p(b),  p(b)  is  the  probability 

density  of  pixel  value  b  }; 

-variance  of  a  block  {  vmc  =  2^,  (b  -  avrg)2  p(b) ) ; 

-skewness  of  a  block  {  skew  =  [2b  (b  -  avrg)^]  /  vrnc3/2  ) ; 

-Laplacian  operator  of  a  block  {  calculated  in  8-neighbourhood  according  to 
3x3  mask  }; 
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-maximum  pixel  value  for  the  block. 

The  results  for  MMPT  and  MLP  are  given  in  Table  2  and  the  resultant  image  is 
shown  in  Fig.5a. 


Fig.4  Structure  of  the  segmentation  process 


Fig.5  Results  of  the  two  step  segmentation  procedure:  a)  Stepl  -  classification  of  the 
groups;  b)  Step  2  -  classification  according  to  the  pixel  attributes 


In  Step  2  we  implement  two  MMPTs  -  one  with  4-inputs-3-outputs  for  the 
processing  of  the  border  blocks  and  3-inputs-2-outputs  for  the  gray/white  blocks. 
The  inputs  for  the  border  blocks  are: 

-pixel  value; 

-difference  from  the  highest  pixel  in  the  block; 

-difference  from  the  highest  neighbor  within  distance  1 ; 

-difference  from  the  next  to  the  highest  pixel  in  the  block. 

Three  outputs  for  the  unimportant  information,  gray  matter  pixels  and  white  matter 
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ones.  In  case  of  gray/white  blocks  we  have: 

-pixel  value; 

-difference  from  the  highest  pixel  in  the  block; 

-difference  from  the  highest  neighbor  within  distance  1 ; 
as  input  feature  vectors  as  well  as  2  outputs  for  gray  matter  and  white  matter  class¬ 
es. 


TABLE  2  PERFORMANCE  COMPARISON  BETWEEN  MMPT  AND  MLP 


%  of  correct 
classification 

computational  complexity 
(weight  updates) 

Classification 

MMPT 

MLP 

MMPT 

MLP 

Step  1 

98.9% 

98.5% 

511  455 

1  512  235 

Step  2  - 
border 
blocks 

90.5% 

88% 

812  500 

2  050  345 

Step  2  - 
gray/white 
blocks 

89% 

85.5% 

857  140 

1  523  010 

The  results  are  given  in  Table  2  and  the  final  segmented  image  is  shown  in 
Fig.5b.  The  same  steps  were  applied  to  MLP.  The  results  of  MLP  segmentation 
are  also  presented  in  Table  2.  Obviously  the  performance  of  the  MLP  is  slightly 
worse  than  that  of  MMPT.  But  the  practical  advantage  of  MMPT  network  to  MLP 
is  in  the  dramatically  decreased  computational  cost,  which  we  measure  as  weight 
updates  required  for  training.  The  data  is  shown  in  Table  2  and  the  difference  is 
evaluated  in  terms  of  times. 


4.  DISCUSSION 

Although  based  on  decision  trees,  MMPT  does  not  inherit  the  major  short¬ 
comings  of  that  approach.  The  proposed  architecture  offers  elegant  way  of  splitting 
the  feature  space  by  hyperplanes  not  necessarily  orthogonal  to  the  axes.  A  basic 
feature  of  the  decision  trees  is  that  they  work  with  symbolic  representation  of  the 
input  information,  which  is  not  suitable  for  many  applications.  By  applying  neu¬ 
ral  networks,  grown  as  decision  tree  we  allow  the  decision  tree  to  deal  with  numeri¬ 
cally  expressed  information  and  we  incorporate  the  ability  of  the  decision  tree  to  in¬ 
terpret  if-then  rules  into  the  network.  The  latter  is  trained  based  on  simple  delta- 
rule,  rather  than  exhaustive  search  throughout  the  whole  feature  set  based  on  calcu¬ 
lation  of  an  information  criterion. 

The  presented  in  [9,  10]  method  is  based  on  breadth-first  searching  tech¬ 
nique,  which  builds  the  tree  in  horizontal  fashion,  i.e.  level  after  level;  one  short- 
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coming  that,  it  is  not  guaranteed  to  find  solutions  deep  into  the  tree  hierarchy  and 
therefore  is  unsuitable  for  application  in  binary  or  small-class  tree  cases.  The  tech¬ 
nique  also  does  not  perform  any  estimation  of  the  neuron  fitness  prior  to  adopting 
this  node  into  the  tree  hierarchy.  The  authors  rely  on  post-building  hyperplane  per¬ 
turbance  strategy  and  tree  pruning.  The  proposed  method  -  MMPT  -  incorporates 
depth-first  search,  i.e.  every  branch  is  searched  to  the  final  classifying  split  in  verti¬ 
cal  fashion.  With  this  approach  we  also  propose  an  estimation  measure  for  the 
best  neuron  to  split  the  feature  subspace.  The  depth-first  search,  combined  with  the 
introduced  measure  ensures  high  correct  classification  rates  even  for  difficult  tasks, 
like  the  intertwined  spiral  problem. 

The  proposed  method  has  been  developed  for  binary  trees  initially  [14].  In 
this  case  the  procedure  takes  four  processing  steps.  The  MMPT  adopts  the  natural 
hierarchy  of  the  process  (Fig.4)  as  it  allows  for  three-class  segmentation.  This  re¬ 
duces  the  value  of  the  total  error,  made  by  the  hierarchical  processing  steps.  Here 
we  also  introduce  more  effective  measure  for  estimation  of  the  best  fit  neuron. 

A  question  that  might  arise  concerns  the  way  to  choose  the  proper  input  fea¬ 
ture  vectors  for  the  training.  For  the  discussed  examples  they  were  chosen  as  best 
combinations  out  of  a  set  of  combinations  -  combinations  of  9  candidates  per  clas¬ 
sification  were  checked.  This  was  done  by  cut  and  try  technique.  However,  if  this 
procedure  is  done  in  the  light  of  class-entropy  criterion,  the  process  will  be  more 
accurate,  systematic  and  speedy  [15].  Another  option  for  improving  the  algorithm 
is  to  include  not  only  sigmoid  but  also  Gaussian  units.  From  experiments  with 
artificial  patterns  we  have  concluded  that  Gaussian  activation  function  could  de¬ 
crease  the  computational  cost  and  better  the  performance. 


5.  CONCLUSION 

In  this  work  we  apply  the  MMPT  neural  network  for  image  segmentation 
problems.  The  proposed  network  training  algorithm  combines  depth-first  search 
with  the  introduced  best  fit  measure,  which  ensures  high  correct  classification  re¬ 
sults  even  when  the  problem  is  highly  is  non-linearly  separable.  Although  it  is 
based  on  decision  tree  growing,  the  MMPT  does  not  resort  to  the  exhaustive  search 
techniques  as  used  in  decision  trees  and  offers  elegant  way  of  splitting  the  input 
feature  space  due  to  the  freedom  in  choosing  the  hyperplane  orientation. 

The  obtained  classification  rate  is  comparable  to  the  performance  of  MLP, 
but  the  proposed  MMPT  has  some  additional  advantages: 

-  building  through  training; 

-  lower  computational  and  structural  complexity; 

-  easier  hardware  implementation  due  to  smaller  number  of  connections. 

However  the  following  remarks  should  be  noted: 

-  implementing  different  activation  functions  in  order  to  reduce  the  computa¬ 
tional  complexity  and  cost; 

-  implementation  of  more  reliable  method  for  choosing  the  proper  input  fea¬ 
ture  vectors. 
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Abstract 


Clustered  microcalcifications  on  X-ray  mammograms  are  an 
important  sign  in  the  detection  of  breast  cancer.  This  paper  quantitatively 
describes  the  usefulness  of  texture  analysis  methods  for  the  detection  of 
clustered  microcalcifications  on  digitized  mammograms.  Comparative 
studies  of  texture  analysis  methods  are  performed  for  the  proposed 
texture  analysis  method,  called  the  surrounding  region  dependence 
method  (SRDM),  and  the  conventional  texture  analysis  methods  such  as 
the  spatial  gray-level  dependence  method  (SGLDM),  the  gray-level  run 
length  method  (GLRLM),  and  the  gray-level  difference  method  (GLDM). 
These  methods  are  applied  to  classify  region  of  interests  (ROIs)  into 
positive  ROIs  containing  clustered  microcalcifications  and  negative  ROIs 
of  normal  tissues.  The  database  is  composed  of  72  positive  and  100 
negative  ROI  images,  which  are  selected  from  digitized  mammograms 
with  a  pixel  size  of  100  x  100  pm2  and  12  bits  per  pixel.  An  ROI  is  selected 
as  an  area  of  128  x  128  pixels  on  the  digitized  mammograms.  A  three- 
Jayer  backpropagation  neural  network  is  employed  as  a  classifier.  The 
results  of  the  neural  network  for  texture  analysis  methods  are  evaluated 
by  the  receiver  operating-characteristics  (ROC)  analysis.  From  the 
viewpoint  of  the  classification  accuracy  and  computational  complexity,  the 
SRDM  is  superior  to  the  conventional  methods. 
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1.  INTRODUCTION 


The  early  detection  of  breast  cancer  is  the  most  important  factor  for 
reducing  breast  cancer  mortality.  It  has  been  empirically  recognized  that  the 
presence  of  clustered  microcalcifications  on  X-ray  mammograms  has  been 
associated  with  an  important  sign  in  the  detection  of  breast  cancer,  where 
individual  microcalcifications  is  up  to  about  0.7  mm  in  diameter  and  with  an 
average  diameter  of  0.3  mm  [1].  Computer-aided  diagnosis  (CAD)  has  been  of 
interest  to  many  researchers  for  the  detection  of  clustered  microcalcifications 
on  mammograms  [2]-[4]. 

In  this  paper,  the  usefulness  of  texture  analysis  methods  is  quantitatively 
analyzed  for  the  detection  of  clustered  microcalcifications.  The  texture 
analysis  methods  are  the  surrounding  region  dependence  method  (SRDM) 
proposed  by  authors  [5],  the  spatial  gray-level  dependence  method  (SGLDM) 
[6],  the  gray-level  run  length  method  (GLRLM)  [7],  and  the  gray-level 
difference  method  (GLDM)  [8].  The  three-layer  backpropagation  neural 
network  is  employed  as  a  classifier.  The  results  of  the  neural  network  for  these 
texture  analysis  methods  are  evaluated  by  the  receiver-operating 
characteristics  (ROC)  analysis  [9].  The  area  under  the  ROC  curve,  Az,  is  used 
as  a  measure  of  the  classification  performance. 


2.  THE  SURROUNDING  REGION  DEPENDENCE  METHOD  (SRDM) 

The  SRDM  is  based  on  the  second-order  histogram  in  two  surrounding 
regions.  Let  us  consider  three  rectangular  windows  centered  on  a  current  pixel 
(jc,y),  as  shown  in  Fig.  1.  In  Fig.  1,  Rj  and  are  the  inner  surrounding  region 
and  the  outer  surrounding  region,  respectively,  and  Wj,  W2,  and  Wj  denote  the 
size  of  each  square  region.  In  this  study,  Wy,  and  have  the  values  of  3,  5, 
and  7,  respectively.  A  region  of  interest  (ROI)  image  is  transformed  into  a 
surrounding  region  dependence  matrix,  which  is  defined  as 

=  0<i<m,0<j<n  (1) 

where  ^  is  a  given  threshold  value,  and  the  values  of  m  and  n  are  the  total 
numbers  of  pixels  of  regions  Rj  and  Ry,  respectively.  In  eq.  (1),  the  element 

a{ij)  is  given  as 

cxiij)  =#  {(x,y)|cy?i  (x,y)  =  i  and  {x,y)  =  7,  (a:,^)  e  4  x  (2) 

where  #  denotes  the  number  of  elements  in  the  set,  and  Lx  x  Ly  is  the  2-D 
image  space.  In  eq.  (2),  the  inner  count  (x,y)  and  the  outer  count  Cyj^  (x,ji;) 
on  the  current  pixel  (x,y)  are  defined  as  follows: 
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CR,ix,y)  =U{{kMU)^Ri  and  [S(x,y)- SikJ)]>  q} 
CR^{x,y)  =#{(A,/)|(A,/)  e/?2  and  [Six,y)  ~  S(k,l)]>  q} 


(3) 


(4) 


where  S(x,y)  is  the  image  intensity  on  the  current  pixel  {x,y).  In  general,  the 
larger  the  threshold  value  q  is,  the  more  microcalcifications  can  be  missed, 
whereas  the  smaller  the  value  q  is,  the  more  sensitive  the  random  noise  effect 
is,  so  that  negative  ROIs  can  be  classified  as  positive.  The  optimal  selection  of 
the  q  value  is  very  important  for  the  classification  performance. 

It  is  evident  that  the  surrounding  region  dependence  matrix  M{q) 
contains  the  textural  information  of  an  image.  The  texture  coarseness  or 
fineness  of  an  image  can  be  interpreted  as  the  distribution  of  the  element  in  the 
matrix  M{q).  Especially,  the  distribution  of  elements  tends  to  spread  near  the 
right  and/or  the  right-lower  comer  of  the  matrix  for  positive  ROIs  containing 
clustered  microcalcifications.  From  the  spread  characteristics  of  the  elements 
in  the  surrounding  region  dependence  matrix,  we  defined  four  textural  features, 
which  were  the  horizontal  weighted  sum  {HWS),  the  vertical  weighted  sum 
(VfVS),  the  diagonal  weighted  sum  {DWS),  and  the  grid  weighted  sum  {GWS) 

[5]. 


Fig.  1.  Configuration  of  the  surrounding  regions  on  the  current  pixel  (x,y). 


3.  EXPERIMENTAL  RESULTS 
3.1.  ROI  Selection 

For  comparison  study  of  the  texture  analysis  methods,  172  ROIs,  with 
each  ROI  having  128  x  128  pixels,  were  selected  from  our  database  of 
digitized  mammograms,  which  were  digitized  with  a  Lumisys  laser  film 
scanner  with  a  pixel  size  of  100  x  100  /am^  and  12  bits  per  pixel.  Among  the 
selected  172  ROIs,  72  ROIs  were  positive,  containing  the  clustered 
microcalcifications,  and  100  ROIs  were  negative,  containing  only  normal 
tissues.  Positive  ROIs  included  clustered  microcalcifications  in  dense  regions 
and/or  in  glandular  tissues.  All  of  the  clustered  microcalcifications  in  positive 
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ROIs  were  verified  by  an  expert  mammographer  based  on  visual  criteria  and 
biopsy  results.  The  clustered  microcalcifications  are  defined  as  containing 
three  or  more  microcalcifications  within  an  ROI  (i.e.,  1.28  x  1,28  cm^). 
Negative  ROIs  include  various  breast  areas  involving  ducts,  breast  boundaries, 
Cooper’s  ligaments,  blood  vessels,  film  artifacts,  a  single  large  calcification 
with  benign  characteristics,  and/or  glandular  tissues. 

3.2.  Conventional  Texture  Analysis  Methods 

Three  conventional  texture  analysis  methods  were  evaluated  with  the 
same  database,  and  all  the  textural  features  were  obtained  from  12  bits  gray- 
level  images. 


3.2.1.  Spatial  Gray-Level  Dependence  Method  (SGLDM) 

The  SGLDM  [6]  is  based  on  the  probability,  p{ij\d,  6),  that  two  pixels, 
which  are  located  with  intersample  spacing  distance  d  and  angled,  have  gray 
level  /  and  gray  level  j.  The  thirteen  textural  features  [6][10]  are  measured 
from  the  probability  matrix,  which  are  energy,  entropy,  correlation,  local 
homogeneity,  inertia,  sum  average,  sum  variance,  sum  entropy,  difference 
average,  difference  variance,  difference  entropy,  and  information  measure  of 
correlation  1,2.  In  this  study,  we  computed  four  spatial  gray-level  dependence 
matrices  according  to  four  different  directions  {9=  0°,  45°,  90°,  and  135°) 
with  a  given  distance  d,  and  calculated  textural  features  for  each  matrix. 

3.2.2.  Gray-Level  Run  Length  Method  (GLRLM) 

The  GLRLM  [7]  is  based  on  the  number  of  times,  g{i,j\9),  that  the 
picture  contains  run  length  j  of  gray  level  i  in  the  given  direction  0.  Four  gray 
level  run  length  matrices  are  computed  according  to  four  different  directions 
{9=  0°,  45°,  90°,  and  135°).  The  five  textural  features  [7]  are  measured  from 
each  matrix,  which  are  short  runs  emphasis,  long  runs  emphasis,  gray  level 
nonuniformity,  run  length  nonuniformity,  and  run  percentage. 

3.2.3.  Gray- Level  Difference  Method  (GLDM) 

The  GLDM  [8]  is  based  on  the  probability  of  occurrence  that  two  pixels 
separated  by  a  specific  displacement  vector  <^have  a  given  difference.  In  this 
analysis,  four  kinds  of  displacement  vectors  are  considered,  such  as  (0,  d),  {-d, 
d),  {d,  0),  {-d,  -d),  where  d  is  intersample  spacing  distance.  The  five  textural 
features  [8]  used  in  the  experiments  are  contrast,  angular  second  moment, 
entropy,  mean,  and  inverse  difference  moment.  In  this  study,  we  computed  the 
probability  density  functions  according  to  four  kinds  of  displacement  vectors 
and  calculated  textural  features  for  each  probability  density  function. 
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3.3.  Classifier 


The  classification  algorithm  used  in  this  paper  is  a  three-layer 
backpropagation  neural  network  [11].  A  nonlinear  sigmoid  function  with  “0” 
and  “1”  saturation  values  is  used  as  the  activation  function  for  each  neuron.  In 
the  training  process,  the  weights  between  the  neurons  are  adjusted  iteratively 
so  that  the  difference  between  the  output  values  and  the  target  values  is 
minimized.  The  weight  values  are  updated  by  iteration  as  follows: 

Wjiil  +  1)  =  +  T]djOj  +  //[wy,(/)  -  Wjj{l  -  1)]  (5) 

where  w^.  is  the  weight  value  from  the  /th  to yth  neurons,  o.  is  the  /th  element 
of  the  actual  output  pattern  produced  by  an  input  pattern,  rj  is  the  learning  rate, 
/  is  the  number  of  epochs,  is  the  error  signal,  and  //  is  a  momentum 

parameter.  In  this  study,  the  learning  rate  rj  and  the  momentum  //  are  0.08  and 
0.7,  respectively.  To  evaluate  the  network  performance  during  the  learning 
process,  a  global  error  measure  is  given  as 


^RMS  ~ 


(6) 


where  o  and  t  are  the  output  value  and  the  target  value  of  neural  network  for 

s  s 

the  gth  input  pattern,  respectively,  and  G  is  the  number  of  training  patterns.  In 
this  study,  the  learning  process  is  stopped  when  the  RMS  error,  is  less 

than  0.1. 


3.4.  Classification  Results 


To  study  on  the  efficacies  of  pattern  classification  by  the  jack-knife 
method,  the  172  ROIs  were  partitioned  arbitrarily  into  training  and  test  sets, 
i.e.,  each  set  consists  of  86  ROIs  containing  50  negative  ROIs  and  36  positive 
ROIs.  All  the  textural  features  were  normalized  by  sample  mean  and  standard 
deviation  of  training  set.  The  LABROCl  algorithm  developed  by  Metz  et  al. 
[12]  was  used  to  fit  the  outputs  of  the  neural  network  obtained  by  the  test  set. 
The  area  under  the  ROC  curve,  Az,  is  used  as  a  measure  of  the  classification 
performance.  Optimum  number  of  hidden  neurons  was  analyzed  for  better  Az. 
Also,  optimum  distance  d  for  the  SGLDM  and  the  GLDM,  and  optimum 
threshold  q  for  the  SRDM  are  analyzed.  Figure  2  denotes  comparisons  of  the 
classification  performances  for  texture  analysis  methods  by  means  of  the  ROC 
analysis.  Figure  3(a)  shows  the  comparison  of  four  ROC  curves  at  the  optimal 
performance  of  each  method  performed  by  the  jack-knife  method. 
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We  also  studied  the  classification  performance  of  texture  analysis 
methods  by  using  the  round-robin  method.  When  there  are  D  sample  patterns, 
this  procedure  trains  the  classifier  with  D-1  samples,  then  uses  the  one 
remaining  sample  as  a  test  sample.  Classification  is  continued  in  this  manner 
until  all  D  samples  have  been  used  once  as  a  test  sample.  Figure  3(b)  shows 
the  comparative  result  of  the  classification  performances  by  the  round-robin 
method  in  terms  of  the  ROC  analysis. 

TABLE  I  shows  the  computation  time  required  to  extract  features  from 
an  ROI,  with  a  128  x  128  pixels  and  12  bits  per  pixel.  All  programs  were 
written  in  C  language  and  executed  on  a  HP  workstation  (715,  100  MHz).  The 
SGLDM  was  very  time-consuming  on  12-bit  processing.  From  the  viewpoint 
of  classification  accuracy  and  computational  complexity,  it  is  apparent  that  the 
SRDM  is  superior  to  the  other  methods. 


3  6  9  12 

Number  of  hidden  neurons 


(a)SRDM 


(b)SGLDM 


ic)Gim  (d)aRLM 


Fig.  2.  Comparisons  of  the  classification  performances  for  texture  analysis  methods. 
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Fig.  3.  (a)  The  comparison  of  ROC  curves  at  the  optimal  performance  of  each  method 
performed  by  the  jack-knife  method,  (b)  The  comparison  of  ROC  curves  of  each 
method  performed  by  the  round-robin  method.  Here  TPF  and  FPF  are  the  true-positive 
fraction  and  false-positive  fraction,  respectively,  and  HN  denotes  the  number  of  hidden 
neurons. 

TABLE  1.  The  comparisons  of  time  required  to  extract  features  from  an  ROI, 
with  a  128  x  128  pixels  and  12  bits  per  pixel  in  HP  workstation  (715,  100 
MHz). 


Texture  Analysis  Methods 

Time  (seconds) 

SRDM 

0.65 

SGLDM 

681.6 

GLRLM 

8.75 

GLDM 

0.3 

4.  CONCLUSIONS 

The  goal  of  this  work  was  to  find  the  most  useful  texture  analysis 
method  performed  in  the  spatial  domain  for  the  detection  of  clustered 
microcalcifications  on  mammograms.  We  performed  comparative  studies  of 
the  performances  between  the  SRDM  and  the  three  conventional  texture 
analysis  methods.  To  evaluate  the  classification  performances,  the  ROC 
analysis  was  performed.  In  spite  of  the  limited  number  of  cases,  the 
performances  of  the  SRDM  are  very  promising.  Further  investigation  of  the 
effectiveness  of  the  SRDM  will  be  conducted  with  a  large  database  in  order  to 
evaluate  the  SRDM  for  real  clinical  use  in  detecting  the  clustered 
microcalcifications  on  mammograms. 
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Abstract 

Our  focus  is  to  use  neural  networks  to  interactively  assist  in  the  ini¬ 
tial  segmentation  of  medical  imagery,  through  learning  the  characteris¬ 
tics  of  a  contour  being  traced  and  projecting  ahead  a  trace  whose  ini¬ 
tial  few  pixels  were  specified.  To  date,  much  of  this  work  is  done  manu¬ 
ally,  since  automatic  techniques  have  yielded  less  than  satisfactory  results 
due  to  prerequisite  background  knowledge  and  noise  in  the  data.  In  our 
framework,  the  expert  interacts  with  the  network  to  provide  the  context, 
and  the  network  learns  the  characteristics  of  the  (potentially  noisy)  local¬ 
ity  and  continues  the  task  until  further  guidance  is  needed.  We  present 
here  an  initial  application  of  this  approach  to  brain  MRI’s,  and  we  dis¬ 
cuss  our  initial  evaluation  of  neurologically-inspired  preprocessing  on  the 
input  pixel  space.  Our  research  directions  are  discussed. 


1  INTRODUCTION 

We  are  focusing  on  providing  real-time  learning  and  trace-ahead  capabilities  for 
region  definition  in  image  analysis  tasks.  In  current  medical  image  analysis,  the 
reference  standard  for  region  delineation  is  an  expert’s  manual  outlining  of  the  re¬ 
gion.  In  certain  domains,  such  as  tumor  identification,  automatic  delineation  has 
made  some  modest  success  (for  example  [7]).  However,  as  Johnson,  et.al.  [4], 
note:  “Although  image  segmentation  and  contour/edge  detections  have  been  in- 
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vestigatedfor  quite  a  long  time,  there  is  still  no  algorithm  that  can  automatically 
find  region  boundaries  perfectly  from  clinically  obtained  medical  images.  There 
are  two  reasons  for  this.  One  is  that  most  of  the  image  segmentation  algorithms 
are  still  noise  sensitive.  The  second  reason  is  that  most  segmentation  tasks  require 
certain  background  knowledge  about  the  region(s)  of  interest.  ” 

Our  model  here  is  that  a  human  expert  sets  down  the  initial  several  pixels  of 
an  image  boundary,  and  a  neural  network  continues  the  task  by  learning  the  local 
landscape  and  continuing  through  similar  image  territory  as  originally  identified. 
One  characteristic  of  neural  nets  is  an  adaptability  to  noise,  and  thus  if  the  initial 
image  territory  is  noisy,  the  network  could  learn  to  navigate  through  it,  addressing 
the  first  concern  above.  In  addressing  the  second  concern,  we  note  that  hole-scene 
analysis,  a  straightforward  task  for  a  human  expert,  has  proved  exceedingly  diffi¬ 
cult  to  automate.  The  expert/network  combination  we  set  forward  capitalizes  on 
what  each  does  best:  the  expert  to  provide  global  perspective  and  context,  and  the 
network  to  quickly  analyze  and  work  through  similar  local  neighborhoods. 

We  have  focused  on  neural  networks  as  the  learning  mechanism  due  to  their 
very  general  abilities.  In  earlier  studies,  we  demonstrated  their  facility  in  learning 
non-linear  region  discriminations[l]. 

2  An  MRI  Application 

Figures  1  and  2  show  sagittal  sections  of  MRI  data  from  the  National  Institute  for 
Mental  Health  (NIMH).  This  data  is  from  an  ongoing  morphometric  analyses  at 
NIMH  of  the  brains  in  monozygotic  twin  pairs[2].  Figure  1  shows  the  raw  MRI 
image.  Their  initial  image  processing  task  is  to  subtract  off  the  non-cerebral  ma¬ 
terial  in  this  image,  resulting  in  Figure  2.  They  currently  do  this  manually  with  a 
simple  pixel  eraser  tool. 


Figure  1:  Raw  data:  sagittal  MRI  section 
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Figure  2:  Cerebrum  only,  cleaned  manually.  (Orientation  and  contrast  levels  have 
been  standardized,  so  this  will  not  match  exactly  to  the  raw  original.) 

3  Neural  Networks  for  Boundary  Tracing 

A  model  of  our  interaction  scenario  is  illustrated  on  an  enlarged  set  of  pixels,  shown 
in  Figure  3.  The  darkest  pixels  represent  a  trace  of  120  pixels.  The  first  few  pixels 
on  the  left  were  traced  manually  with  a  cursor  over  the  image;  that  pixel  segment 
became  the  neural  nets  exemplar  to  learn  its  contour,  and  then  project  the  trace 
along  the  contour  learned. 


Figure  3:  Enlargement:  Network-traced  path  through  the  grey-scale  landscape. 

The  contour  is  learned  and  followed  by  tracking  characteristics  of  pixels  to  the 
left,  to  the  right,  and  on  the  directed  path,  as  in  Figure  4.  The  neural  net  evaluates 
possible  next  pixels  on  the  contour,  based  on  what  it  has  been  trained  on  in  the  past. 

3.1  Neural  Net  Design  Issues  -  Output  Representations 

Our  initial  neural  network  design  had  one  output  unit,  providing  a  single  value  on 
the  range  [0,1]:  a  low  evaluation  indicates  a  pixel  is  off  to  the  left  of  the  contour,  a 
high  evaluation  indicates  off  to  the  right,  and  a  value  near  0.5  indicates  the  pixel  is 
on  the  desired  contour.  The  network  learns  an  evaluation  function  that  produces  a 
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Figure  4:  A  path  and  its  neighbors 


smoothly  changing  value  as  a  pixel  and  its  neighbors  change  from  left-of-contour 
values,  to  on-contour  values,  and  then  to  right-of-contour  values. 

An  alternative  network  design  we  studied  also  has  one  output  unit,  but  this  unit 
produces  a  low  value  for  pixels  centered  on  the  contour,  and  a  high  value  for  off- 
contour  pixels.  This  style  of  output  is  a  feature-detector  unit,  where  the  output  unit 
goes  low  on  recognizing  the  contour,  and  stays  high  in  non-contour  regions. 

3.2  Neural  Net  Design  Issues  -  IVaining 

Our  initial  experiments  demonstrated  the  smooth-evaluation-function  output  unit 
works  well  when  following  a  gradient,  or  ramp  edge  (for  example,  see  Figures  3 
and  5).  Unfortunately,  this  is  unworkable  if  the  network  is  trying  to  learn  to  follow 
a  thin  line  rather  than  a  gradient.  When  following  a  line,  the  local  neighborhoods 
off  to  the  left  and  right  side  of  the  line  are  similar,  and  since  they  are  expected 
to  produce  different  outputs,  this  is  no  longer  a  functional  form  and  thus  can’t  be 

learned.  ,  .  ur  u  a 

Exemplars  of  the  contour  for  training  are  easy  to  derive,  given  an  established 

contour  in  the  image.  Over  the  training  set,  the  true  extension  of  the  curve  for 
several  pixels  ahead  is  known,  and  can  be  added  to  the  training  set.  A  key  issue, 
though,  is  the  generation  and  spacing  of  negative  exemplars.  The  set  of  possible 
extensions  considered  for  each  point  needs  to  be  looked  at  in  the  known  training 
set,  and  appropriate  non-contour  training  values  established. 


3.3  Neural  Net  Design  Issues  -  Input  Representations 

There  are  a  variety  of  options  for  representing  the  input  pixel  space: 


•  raw  pixel  values 

•  filtered  inputs  (Laplacian,  Sobel, ...) 
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•  masks  based  on  neurologically-inspired  models  (center-surround,  directed 
gradient, ...) 

Since  neural  nets  can  automatically  extract  high-order  moments  from  the  data,  it 
may  seem  best  to  just  feed  raw  neighborhood  pixel  data  into  the  network,  and  let  it 
automatically  learn  its  best  model.  This  is  the  strategy  used  by  the  path-following 
system  ALVINN  [3]. 

However,  in  our  application,  efficient  learning  is  also  an  issue,  since  one  goal 
of  our  systems  is  to  keep  pace  with  human  operators.  Appropriate  preprocessing 
of  inputs  should  be  able  to  accelerate  the  contour  learning. 

4  Early  Results 

Figure  5  illustrates  several  network-generated  traces  separating  the  cerebrum  from 
its  surrounding  tissue.  Each  was  generated  in  a  clockwise  direction.  The  network 
used  for  this  result  had  a  smooth-evaluation-function  output  and  normalized  pixel 
value  inputs. 


Figure  5:  Net  traced  segments.  (Note:  this  has  been  contrast  enhanced  and  light¬ 
ened  to  better  show  the  traces;  the  original  is  a  color  trace  on  a  full-range  grey-scale 
image). 

The  initial  25  pixels  of  trace  3  were  used  as  the  training  set.  The  network  con¬ 
tinued  tracing  ahead  until  it  ran  into  trouble,  in  areas  of  the  contour  unrepresented 
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by  the  training  set.  The  other  traces  represent  restarts  of  the  network-generated 
tracing  (without  further  training)  in  new  areas  of  the  image.  The  end  of  trace  5 
shows  an  area  where  the  brain  boundary  is  indistinct,  and  confounded  by  possi¬ 
ble  ghost  structures  from  the  MRI;  the  network  performance  is  degraded  by  the 
region’s  similarity. 


5  Experimental  Evaluation  of  Input  Representations 

We  performed  some  initial  experiments  aimed  at  verifying  the  hypothesis  that  pre¬ 
processing  of  inputs  speeds  learning.  The  experimental  design  used  3  inputs  spaces 
X  2  outputs  X  2  tasks.  The  speed  of  learning  was  quantified  by  the  time  taken  to 
reach  an  RMS  error  of  0.1  from  the  target  values 

The  three  input  representations  consisted  of  the  raw  pixel  values  and  two  neu- 
rologically  inspired  models,  illustrated  here  in  Figure  6.  The  center-surround  filter 
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Figure  6:  Input  filters. 

is  based  on  early  retinal  processing,  while  the  oriented  gradient  represents  an  ori¬ 
ented  filter,  such  as  those  in  the  complex  cells.  These  three  filters  evaluate  to  +1 
when  over  a  constant  area. 

The  two  output  spaces  were  those  discussed  above:  the  SEV  (smooth-evaluation- 
function),  and  the  FD  (feature-detector). 

The  two  tasks  were  1)  edge  following  and  2)  line  following.  A  basic  graphic 
was  generated  to  minimize  confounding  influences  of  noise  and  texture  in  real  im¬ 
ages,  shown  here  in  Figure  7. 


Figure  7:  Line  and  Gradient  test  figures. 
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The  neural  network  topology  consisted  of  one  input  layer  with  five  inputs,  pre- 
processed  as  discussed  above;  there  were  10  hidden  units  and  one  output  unit  (ei¬ 
ther  SEV  or  FD).  Other  parameters  held  constant  across  the  runs  were:  the  initial 
training  segment  used;  the  pseudo-random  network  initialization;  and  the  learning 
and  momentum  rates. 

5.1  Basic  Results 

The  following  table  summarizes  the  full  set  if  initial  results: 


SEV- 

SEV- 

FD- 

FD- 

Edge  Task 

Line  Task 

Edge  Task 

Line  Task 

Raw  Pixels 

traces  well; 

X 

traces  well 

traces  ok 

learns  very  fast 

Center- 

traces  well; 

X 

traces  well; 

traces  well 

Surround 

learns  fast 

learns  fast 

Oriented 

traces  well 

X 

traces  well; 

traces  well; 

Gradient 

learns  fast 

learns  fastest 

Table  1:  Summary  of  Initial  Results 

In  one  specific  case,  the  SEV  output  with  raw  pixel  inputs bn  the  edge-following 
task  was  the  fastest  learning  system,  reaching  its  learning  objective  in  200  epochs. 
With  the  other  two  input  representations,  several  thousand  training  epochs  were 
required.  The  drawback  of  the  SEV  output  model  is  that  it  works  only  on  edges, 
however,  and  fails  miserably  when  learning  to  follow  a  line. 

With  the  FD  output  model,  both  tasks  could  be  learned  well.  For  the  edge¬ 
following  task,  the  center-surround  and  oriented-gradient  filters  both  reached  their 
learning  criteria  in  100  epochs  vs.  800  epochs  for  raw  pixel  inputs.  Equivalent 
improvements  were  evident  for  the  line-following  task. 

It  is  interesting  to  note  that  the  SEV  outputs  with  raw  pixel  inputs  was  the 
fastest  learning,  but  most  specific  combination  tested.  In  a  way,  they  are  custom 
fit  for  each  other.  On  an  edge,  the  pixels  on  one  side  will  all  have  low  values  and 
high  values  on  the  other;  it  should  be  easy  to  learn  a  smooth  mapping  onto  [0,1]  in 
this  case. 

Using  the  FD  output  model,  the  network  could  learn  both  tasks,  and  in  this  case 
the  preprocessed  inputs  yielded  faster  learning.  This  neurologically-inspired  com- 
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bination  was  a  more  general  learning  mechanism,  even  though  somewhat  slower 
in  learning  overall. 


6  Research  Directions 

We  are  continuing  to  study  various  architectures  appropriate  for  the  task.  An  addi¬ 
tional  benefit  from  the  feature-detector  style  output  is  a  quantification  of  a  “confu¬ 
sion”  measure  (i.e.  output  neither  high  nor  low),  for  when  the  network  needs  to  fall 
back  to  the  human  expert  for  intervention.  The  ability  to  know  when  the  network 
is  outside  of  its  domain  of  expertise  is  a  key  implementation  detail  when  adding 
such  automated  assistance  onto  existing  tracing  tools. 

The  traces  of  Figure  5  were  generated  using  a  very  local  5-pixel  neighborhood 
in  searching  ahead.  This  works  better  in  a  relatively  noise-free  domain,  such  as 
the  high-resolution  photography.  However,  when  noise  can  interrupt  a  contour, 
the  network  must  have  a  larger  perspective  to  continue  past  the  noise.  Some  other 
filters  we  plan  to  try  are  line-extension  fields,  analogous  to  those  recently  identified 
in  complex  cells. 

Additionally,  there  are  many  basic  engineering  decisions  in  selecting  and/or 
appropriately  weighting  training  data,  when  many  neighborhoods  along  the  con¬ 
tour  are  redundant,  and  a  few  key  cases  capture  the  essence  of  the  contour  in  its 
overall  environment. 

This  tracing  model  shows  promise.  On  straightforward  contours,  with  20  ini¬ 
tial  pixels  of  training  data,  contours  can  be  followed  continuously  for  hundreds  of 
pixels.  What  remains  to  be  measured  is  how  well  this  squares  with  a  ground-truth 
of  an  expert’s  delineation,  and  over  how  complex  a  landscape  a  network  can  be 
adequately  trained. 

A  further  extension  of  this  work  is  into  contour  identification  across  the  3D  vol¬ 
ume  composed  of  many  parallel  slices.  We  plan  to  explore  contour  extension  on 
adjacent  layers,  without  further  training.  And  when  slices  are  sufficiently  well  reg¬ 
istered,  several  traced  layers  could  analogously  be  used  to  propagate  the  contour 
identification  to  succeeding  layers. 
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ABSTRACT 

This  paper  explores  the  potential  for  the  application  of  neurocomputing  tech¬ 
nology  to  the  domain  of  post-operative  liver  transplant  monitoring.  The  in¬ 
vestigation  compares  a  neural  network  model  with  two  classical  statistical 
techniques  using  biochemical  information  obtained  from  a  set  of  liver  trans¬ 
plant  patients.  Each  approach  combines  the  results  of  a  number  of  liver 
function  tests  to  predict  the  presence  of  allograft  rejection.  Each  system  is 
assessed,  relative  to  the  clinical  gold  standard,  in  terms  of  its  overall  accu¬ 
racy  and  degree  of  advance  warning  offered.  Applying  non-linear  methods 
does  offer  an  advantage  over  the  traditional  linear  techniques.  The  underly¬ 
ing  structure  of  the  data  set  has  also  been  determined  using  k-means  cluster 
analysis.  This  analysis  suggests  important  directions  for  future  investigation 
including  the  use  of  temporal  information.  Preliminary  results  of  incorporat¬ 
ing  this  temporal  information  are  also  presented. 

INTRODUCTION 

Transplantation  of  the  human  liver  is  currently  the  only  viable  therapeutic 
technique  that  can  be  applied  to  patients  suffering  from  end  stage  liver  failure. 
This  procedure,  while  having  advanced  a  great  deal  since  its  inception  in  the 
early  1980s,  still  has  a  great  many  risks  associated  with  it. 

Liver  transplant  patients  must  be  maintained  on  immunosuppressive  drug 
therapy.  This  deliberately  inhibits  their  immune  system  from  detecting  and 
destroying  their  new  liver. 

Substantial  advances  in  the  transplantation  arena,  to  date,  have  involved 
the  improved  understanding  of  the  graft  rejection  process  and  the  develop¬ 
ment  of  more  specific  immunosuppressive  drugs.  While  the  improvements 
in  the  available  drugs  have  allowed  better  control  of  rejection,  the  drugs  are 
still  far  from  perfect.  In  particular  these  drugs  are  still  among  the  most  toxic 
prescribed  to  any  population  of  patients. 

Liver  transplant  patients  are  especially  difficult  to  manage  since  their 
absorption  /  metabolism  of  the  immunosuppressive  drugs  is  affected  by  vari- 
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ations  in  the  function  of  their  new  liver  [1,  2].  This  unpredictability  leads 
to  difficulties  in  selecting  the  optimum  dose  of  drug  required  to  protect  the 
patient  from  rejection  and  minimise  the  risk  from  adverse  side  effects  associ¬ 
ated  with  over  immunosuppression.  This  requires  the  clinical  staff  to  monitor 
the  patient  frequently  and  make  adjustments  to  their  drug  regime  based  on 
information  gleaned  from  a  number  of  biochemical  and  haematological  tests. 
These  tests  include  liver  function  tests  (LFTs)  and  blood  cell  type/count 
analyses  designed  to  indicate  whether  the  patients  immune  system  is  invok¬ 
ing  some  response  (either  fighting  infection  or  rejecting  the  new  liver).  More 
subjective,  clinical,  indicators  include  the  general  condition  of  the  patient 
and  their  speed  of  recovery  from  surgery. 

Many  of  the  modifications  to  the  drug  therapy  are  made  in  response  to 
suspected  or  confirmed  rejection  of  the  liver.  The  presence  of  a  rejection 
process  may  not  be  obvious  from  the  available  biochemical  data  until  the 
rejection  episode  is  relatively  well  advanced.  Thus,  dose  changes  may  be 
relatively  late  and  considerable  cellular  damage  may  already  have  occurred. 

The  risk  of  rejection  is  at  its  greatest  in  the  initial  period  following  trans¬ 
plantation.  At  this  early  stage,  the  clinical  staff  are  concerned  primarily  with 
preventing  graft  rejection  and  they  maintain  the  patient  on  a  higher  level  of 
immunosuppression.  Consequently,  adverse  side  effects  are  most  common 
at  this  time.  This  initial  concern  with  the  avoidance  of  rejection  has  moti¬ 
vated  our  research  to  focus  on  predicting  impending  rejection  at  an  earlier 
stage.  This  task  forms  an  integral  part  of  the  final  goal  of  assisting  with  the 
determination,  day  to  day,  of  the  appropriate  dose  of  immunosuppression. 

The  primary  objective  of  this  study  has  been  to  explore  the  feasibility 
of  using  neurocomputing  to  assist  with  post-operative  monitoring  of  liver 
transplant  patients.  Applications  include  categorising  the  types  of  risk  that 
can  affect  these  patients  and  using  dynamically  acquired  and  historical  data 
to  assess  the  impact  that  such  risks  will  have  on  their  short  and  long  term 
management. 

Other  researchers  [3]  have  explored  the  use  of  connectionist  techniques  in 
this  domain  but  have  used  the  technology  in  a  static  mode  which  predicts 
long  term  graft  survival  based  on  pre-operative  and  early  post-operative  data. 
In  our  application,  post-operative,  on-line,  data  is  analysed  as  and  when  it 
becomes  available.  This  research  explores  the  niche  for  increasing  the  degree 
of  advance  warning  of  impending  rejection  available  to  the  clinical  staff  so 
that  pre-emptive  modifications  to  the  immunosuppressive  drug  therapy  may 
be  made. 

In  this  paper  we  compare  three  different  classification  systems  in  the  per¬ 
formance  of  this  type  of  task:  logistic  regression  (LR),  linear  discriminant 
analysis  (LDA)  and  multi  layer  perceptrons  (MLP).  The  objective  has  been 
to  determine  the  degree  of  advantage,  relative  to  the  well  known  statistical 
techniques,  offered  by  the  neurocomputing  approach. 

Each  of  these  systems  were  designed  to  take  the  LET  results  as  input  and 
yield,  as  output,  a  measure  of  the  risk  of  rejection.  The  performance  of  each 
system  was  evaluated  in  terms  of  two  criteria: 
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Biochemical  Test  Information 

Test 

Name 

Class 

Ref. 

Range 

Units 

Half 

Life 

GST 

ALP 

ALT 

BILI 

Liver  Enzyme 
Liver  Enzyme 
Liver  Enzyme 
Liver  Excretion 

<10 

30-135 

0-50 

0-17 

/uflfL"' 

UL-^ 

<  1  hour 
40  hours 

47  hours 
Variable 

Table  1:  Selected  Biochemical  Test  Information. 


•  Predictive  accuracy  for  rejection. 

•  Ability  to  offer  earlier  warning. 

The  techniques  used  to  perform  these  evaluations  on  the  systems  are  de¬ 
scribed  in  the  later  sections  of  this  paper. 

AVAILABLE  DATA  SOURCES 

A  database  was  constructed  using  information  collected  by  following  the  test 
results  of  95  liver  transplants  in  a  total  of  80  patients  (15  re-transplants 
occurred).  Their  biochemical  and  haematological  status  was  collected  for  up 
to  100  days  following  transplantation. 

This  paper  focuses  on  a  specific  subset  of  this  database.  A  more  detailed 
account  of  the  full  database  structure  and  content  together  with  the  results 
of  a  number  of  other  analysis  techniques  is  available  in  [4] . 

The  database  was  refined  to  focus  on  a  few,  frequently  used,  LFTs  specif¬ 
ically  GST^  ALP\  ALT^  BILI^.  The  numeric  values  for  each  of  these  tests 
were  extracted  to  form  a  number  of  4  dimensional  feature  vectors  (or  frames) 
corresponding  to  the  measurements  on  a  patients  on  a  given  day.  Each  frame 
was  augmented  with  a  tag  which  indicated  if  the  data  originated  from  a  pa¬ 
tient  who  was  undergoing  a  rejection^  or  not.  This  refinement  process  yielded 
3206  data  frames  (56  reject,  3150  non-reject).  Figure  1  shows  an  example 
of  the  profile  followed  by  these  four  TFTs  over  the  course  of  a  single  pa¬ 
tients  post-operative  period.  Tables  1  and  2  outline  the  general  biochemical 
characteristics  of  these  four  tests.  A  detailed  description  of  the  mode  of  op¬ 
eration  of  these  tests  is  available  in  [4].  In  summary,  these  tests  detect  liver 
damage  by  measuring  the  amount  of  enzyme  released  from  damaged  cells 
(liver  enzyme)  and  the  degree  of  blockage  of  the  liver  bile  drainage  system 
(liver  excretion).  The  range  of  values,  exhibited  by  a  population  of  healthy 
volunteers,  is  shown  in  Table  1  as  the  reference  range.  The  range  exhibited 
by  our  liver  transplant  patients  is  shown  in  Table  2.  The  distribution  of  the 

^ Serum  concentration  of  o'-Glutathione  S-transferase  (GST).  Serum  Alkaline  Phos¬ 
phatase  (ALP)  and  Alanine  Transiminase  (ALT)  activities. 

^  Serum  bilirubin  concentration  (BILI). 

^In  line  with  current  clinical  practice,  only  biopsy  confirmed  rejections  were  used  to 
identify  rejection. 
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1  Biochemical  Test  Statistics 

ALP 

ALT 

GST 

DILI 

Reject 

max 

1720 

1940 

1140 

479 

min 

63 

44 

3.9 

17 

median 

282 

249.5 

25.5 

144.5 

Non-Reject 

max 

3890 

9440 

47250 

993 

min 

23 

4 

0.1 

2 

median 

224 

112 

10 

48 

Table  2;  Selected  Statistics  of  the  Biochemical  Tests. 


Example  of  Liver  Function  Test  Values  vs.  Time 


Figure  1:  Example  profile  of  LFTs  over  time.  Two  rejections  occur  in  this 
example  denoted  by  the  dots  (at  y=1000). 


test  values  is  also  known  to  be  log-normal  [4]  and  consequently  the  data  has 
been  transformed  logarithmically  to  account  for  this. 

The  frame  set,  described  above,  was  segmented  by  a  random  selection 
process.  This  procedure  chose  a  number  of  frames  from  each  of  the  two  classes 
(reject/non-reject)  to  create  a  training  set.  A  random  25%  portion  of  the  data 
was  excluded  for  use  in  testing.  This  procedure  was  repeated  to  yield  a  100 
pseudo-random  training  and  test  data  sets  for  use  in  constructing/ evaluating 
the  performance  of  each  of  the  classification  systems. 

MODEL  DESCRIPTIONS 

The  task  of  interest  here  can  be  viewed  as  a  multivariate  two  class  prediction 
problem.  The  independent  variables  being  the  selected  liver  function  tests 
(as  provided  by  the  data  frames  described  above)  and  the  dependent  variable 
the  risk  of  rejection  in  the  immediate  future.  All  of  the  models  documented 
here  apply  this  as  the  high  level  strategy  with  the  individual  solutions  varying 
only  in  their  implementation  approach. 

The  sensitivity  of  each  model,  to  the  data  used  in  its  construction,  has 
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been  assessed  by  computing  the  distribution  of  the  results  obtained  from  the 
model  over  the  100  training/test  data  sets. 

The  regression  parameters  for  the  logistic  regression  model  were  com¬ 
puted  on  the  training  segments  of  the  100  data  sets  and  analysed  using  the 
corresponding  test  segments. 

The  linear  discriminant  analysis  model  was  based  around  a  Fisher  trans¬ 
formation  [5,  6]  of  the  4D  data  space  into  a  single  discriminant  dimension. 
Using  this  discriminant  projection  it  was  possible  compute  a  single  value  for 
any  4D  data  vector.  Defining  a  threshold  value  for  this  parameter  allows 
a  classifier  to  be  constructed  which  reports  rejection  if  the  fixed  decision 
threshold  is  exceeded. 

The  architecture  used  for  the  multi-layer  perceptron  approach  was  a  fully 
connected  feed  forward  network  with  four  input  units,  five  hidden  units  and  a 
single  output  unit.  The  hidden  and  output  units  have  a  logistic  transfer  func¬ 
tion.  A  small  number  of  hidden  nodes  were  chosen  to  mediate  over-fitting  of 
the  relatively  small  data  set.  The  network  was  trained  on  a  balanced  version 
of  the  data  set  having  the  “reject”  class  artificially  expanded  by  duplication 
of  the  existing  members  of  this  class.  The  data  was  pre-processed  to  give 
the  data  zero  mean  and  unit  variance.  The  parameters  used  to  perform  this 
preprocessing  were  calculated  on  the  training  data  but  were  stored  to  allow 
transformation  of  unseen/test  data. 

RESULTS  AND  ANALYSIS 
Analysis  Techniques 

This  section  makes  considerable  use  of  the  ROC  [7,  8]  analysis  technique. 
This  technique  is  frequently  used  to  compare  the  performance  of  two  class 
decision  systems.  In  particular  this  paper  makes  use  of  the  technique  to 
directly  compare  the  capabilities  of  the  various  classifiers  applied  to  the  liver 
transplant  data.  More  specifically  both  the  shape  and  area  [9]  under  the  ROC 
curve  are  used  to  show  the  differences  in  performance  of  the  liver  transplant 
classification  systems.  The  ROC  curves  show  both  the  strength  of  the  system 
at  detecting  the  rejections  (sensitivity)  as  well  as  the  degree  to  which  the 
system  is  specific  to  rejection  and  tolerant  of  other  phenomena  (specificity). 
Each  point  on  an  ROC  curve  denotes  these  two  quantities  (expressed  as  a 
rate)  for  a  chosen  threshold  value  for  the  output  of  the  classifier.  Specifically 
the  ROC  curve  is  formed  by  selecting  a  series  of  threshold  values  for  the 
output  of  the  system  (e.g.  (output  >  0.5)  ->■  rejection)  and  evaluating  how 
sensitive  and  specific  the  system  is  for  a  test  data  set.  An  optimal  ROC  curve 
is  one  where  the  test  is  highly  specific  (1  -  specificity)  =  0  and  highly  sensitive 
(sensitivity  =  1)  and  leads  to  a  “square”  ROC  curve  which  runs  from  (0,0) 
to  (0,1)  to  (1,1).  The  line  shown  on  the  ROC  curves,  denoted  “Reference”, 
represents  the  result  of  a  system  which  is  random. 

The  area  under  the  ROC  curve  can  be  used  as  direct  comparison  measure. 
It  can  be  interpreted  as  the  probability  of  correctly  ordering  a  pair  of  examples 
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(one  rejecting  and  one  non-rejecting).  Thus  the  closer  the  area  under  the 
ROC  curve  is  to  1.0  the  better  the  classification  system  is  at  performing  its 
task. 

The  ROC  curve  and  area  for  each  of  the  models  is  given  below.  Specifically 
example  ROC  curves  are  supplied  which  show  the  performance  of  the  model 
for  a  single  data  set.  Alongside  this,  histograms  of  the  area  under  the  ROC 
are  supplied.  These  show  the  distribution  of  the  areas  observed  over  the  100 
training/test  data  sets.  A  similar  histogram  is  supplied  which  depicts  the 
area  as  a  function  of  advance  warning  level. 

Logistic  Regression  Models 

Figure  2a  depicts  the  general  performance  of  the  logistic  regression  based 
classifier.  It  can  be  seen  that  the  classifier  is  noticeably  better  than  random 
but  that  it  is  still  far  from  optimal. 

Figure  3a  summarises  the  performance  of  this  model  over  the  available 
data  sets  and  confirms  that,  in  general,  there  is  a  approximately  75%  chance 
of  correctly  ordering  a  random  reject /non- reject  pair  (i.e.  the  classifier  pro¬ 
duces  a  larger  output  value  for  the  reject  example). 

Figure  3b  shows  the  performance  decay,  towards  random,  of  the  model 
as  greater  advance  warning  is  expected.  Specifically  this  shows  the  result 
of  assessing  the  model  on  data  obtained  on  days  preceding  the  actual  biopsy 
confirmed  rejection.  Figure  2b  summarises  this  advance  warning  performance 
over  the  available  random  data  sets. 

The  wide  variation  in  the  zero  warning  histogram  can  be  attributed  to  the 
limited  number  of  data  points  available  in  the  testing  set.  Thus  for  a  small 
change  in  the  threshold  value  used  in  the  classifier  large  changes  can  occur 
in  the  sensitivity  and  specificity  rates.  This  phenomena  combines  with  the 
variations  induced  by  the  different  training  sets  to  produce  a  considerably 
larger  spread  of  areas  under  the  ROC  curve. 

Discriminant  Analysis  Models 

The  result  of  sweeping  the  decision  threshold  through  a  range  of  values  and 
computing  an  ROC  for  an  example  LDA  model  is  shown  in  figure  2c.  It  can 
be  seen  that  this  system  is  also  significantly  better  than  a  random  classifier 
but  less  than  optimal. 

Figure  2d  depicts  the  distribution  of  areas  for  the  different  data  sets.  It 
can  be  seen  that  there  is  a  marginal  improvement  in  performance  obtained 
from  the  LDA  model  (relative  to  the  LR  model)  with  a  slightly  heavier  right 
tail.  This  marginal  improvement  is  also  reflected  in  the  advance  warning 
performance.  It  can  however  be  seen  that  there  is  a  rapid  degradation  in  the 
performance  of  the  classifier  as  the  degree  of  warning  is  increased  (i.e.  the 
area  under  the  curve  tends  rapidly  to  0.5).  The  speed  with  which  it  degrades 
is  of  considerable  importance  in  terms  of  developing  a  system  which  can  yield 
an  acceptable  level  of  advance  warning. 
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Exampte  ROC  curve  lor  a  logistic  regression  model 


Distribution  ot  ROC  areas  lor  advance  warning  (LR) 


1.S|!.cfcW 


(a)  Example  logistic  regression  (LR) 
classifier  ROC. 


Example  ROC  curve  ter  a  LDA  based  model 


i.SpKfeiiy’ 


(c)  Example  linear  discriminant 
analysis  (LDA)  classifier  ROC. 


Exarr^e  ROC  curve  for  a  MLP  based  model 


(e)  Example  multilayer  perceptron 
(MLP)  classifier  ROC. 


(b)  Histograms  of  area  under  the 
ROC  (LR  model). 


Oistrlbufon  ol  HOC  areas  lor  advance  warning  (LDA) 


(d)  Histograms  of  area  under  the 
ROC  (LDA  model). 


Distribution  ot  ROC  areas  tor  advance  warning  (MLP) 


(f)  Histograms  of  area  under  the 
ROC  (MLP  model). 


Figure  2:  Classifier  performance  expressed  both  as  Receiver  Operating  Char¬ 
acteristic  (ROC)  curves  and  histograms  of  the  area  under  the  curve  computed 
for  the  different  cross  validation  sets.  The  histograms  also  depict  the  advance 
warning  performance  of  each  system. 
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Distribution  ol  ROC  areas 


Advance  warning  ROC'S 


(a)  Distribution  of  the  area  under 
the  ROC  curve  for  different  cross 
validation  data  sets  (both  test  and 
training). 


(b)  Logistic  regression  model  perfor¬ 
mance  in  terms  of  providing  advance 
warning. 


Figure  3:  Generalisation  and  advance  warning  performance  for  the  logistic 
regression  model. 

Multi-Layer  Perceptron  Models 

The  performance  of  a  MLP  system  is  shown  in  figure  2e.  It  can  be  seen  that 
there  is  an  enhanced  performance  relative  to  the  more  traditional  LR  and 
LDA  models.  However  none  of  the  systems  investigated,  including  the  MLP, 
show  optimal  performance.  It  can  also  be  seen  that  the  MLP  model  performs 
marginally  better  on  the  training  set  than  on  the  test  set.  This  implies  that, 
at  least  with  the  example  shown  in  figure  2e,  there  is  a. small  degree  of  over 
fitting  of  the  the  data  in  the  MLP  model.  Conversely  figure  2f  shows  that  the 
MLP  still  performs  better  when  assessed  with  the  cross  validation  approach. 

CONCLUSION  AND  FUTURE  DIRECTIONS 

This  paper  presented  some  preliminary  results  of  applying  neurocomputing 
technology  to  predicting  allograft  rejection  in  liver  transplant  recipients.  Pos¬ 
ing  the  problem  as  a  classification  task,  it  provided  a  comparison  of  classifica¬ 
tion  performance  (in  terms  of  overall  accuracy  and  degree  of  advance  warn¬ 
ing)  of  a  MLP  based  system  relative  to  traditional  statistical  approaches. 
The  liver  function  tests,  taken  in  combination,  do  seem  to  carry  useful  extra 
information  capable  of  predicting  rejection  two  or  three  days  ahead  in  time. 
The  non-linear  neural  network  classifiers  outperform  the  classical  linear  ap¬ 
proaches  for  this  task. 

The  non-optimal  behaviour  of  these  preliminary  models  can  be  partly  at¬ 
tributed  to  the  overlap  in  the  distributions  of  the  data.  Considerable  overlap 
between  the  reject  and  non-reject  distributions  can  be  observed  when  visualis¬ 
ing  the  data  in  1  or  2  dimensions  (see  Figure  4a).  This  overlap  is  attributable 
to  variation  in  severity  of  rejection,  to  some  degree  of  mis-labelling  of  the  data 
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(a)  Scatter  Plot  of  the  reject/non-reject  (>>)  Trajectories  across  the  Fisher  plane 

data  on  the  2D  Fisher  plane. 


Figure  4:  Distribution  of  reject /non-reject  data  on  the  Fisher  plane  and  the 
trajectory  followed  by  data  prior  to  rejection  episodes. 


Cluster  Centres 

Cluster 

GST 

ALT 

ALP 

BILIRUBIN 

Nr 

Nnr 

1 

6.4 

69.7 

144.0 

28.6 

6 

1705 

2 

25.9 

234.5 

392.3 

132.4 

60 

1758 

Table  3:  Cluster  centres  for  k=2  in  original  units.  Nr  and  Nnr  columns 
show  the  number  of  vectors  in  each  cluster  that  are  marked  as  rejecting  and 
non-rejecting  in  the  original  labelling. 


forming  the  non-reject  class  and  inter-patient  variability.  K-means  [10]  clus¬ 
tering  has  been  applied  to  this  database  in  an  effort  to  identify  structure  in 
the  data  and  any  potential  mis-labelling.  Table  3  shows  the  result  of  splitting 
the  data  into  two  clusters.  The  clinical  reference  ranges  listed  in  Table  1  com¬ 
pare  well  with  the  values  selected  by  the  k-means  algorithm  for  class  1.  This 
class  contains  very  few  examples  of  measurements  associated  with  rejection. 
Class  2,  conversely,  shows  elevated  values  for  the  LFTs  and  is  responsible 
for  the  majority  of  vectors  associated  with  rejection.  However  this  class 
also  contains  a  substantial  number  of  members  which  were  originally  labelled 
as  non-rejecting.  A  sizeable  proportion  of  these  samples  occur  in  the  early 
post-operative  period  and  are  not  associated  with  rejection  but  with  damage 
caused  during  the  transplant  procedure.  Including  trend  information  allows 
these  early  measurements  to  be  interpreted  more  appropriately  since  initially 
there  is  a  decline  in  the  test  values  consistent  with  normal  clearance. 

Figure  4b  shows  the  trajectory  followed  by  a  number  of  the  patients  as 
they  approach  a  rejection  episode  (only  patients  having  5  consecutive  mea¬ 
surements  leading  up  to  rejection  are  displayed).  The  solid  lines  identify  those 
examples  where  the  patient  followed  a  consistent  trend  towards  the  rejecting 
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population  centre.  The  dotted  lines  show  the  small  number  of  example  where 
they  did  not. 

Principled  techniques  for  the  inclusion  of  this  trajectory  information  are 
currently  under  investigation  together  with  techniques  for  estimating  values 
for  missing  data. 
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Abstract 


Implantable  cardioverter  defibrillators  (ICD)  admin¬ 
ister  high  voltage  shock  therapies  to  terminate  dan¬ 
gerous  cardiac  arrhythmias.  Improving  the  func¬ 
tionality  of  these  devices  to  include  on-line  diagnosis 
based  on  Intracardiac  Electrogram  (ICEG)  morphol¬ 
ogy  and  to  log  dangerous  signals  is  important  for 
their  more  widespread  use.  It  is  essential  that  the 
ICD  implement  a  signal  compression  scheme  due  to 
the  limited  memory  in  the  device.  We  have  fitted 
gaussian  mixture  models  to  the  ICEG  signals  in  or¬ 
der  to  investigate  to  what  extent,  non-linear  data 
models  are  advantageous  in  this  application  com¬ 
pared  to  the  traditional  linear  approaches  used  in  the 
field  and  to  explore  the  common  features  between 
classification  and  compression.  Results  of  fitting  the 
mixture  models  show  that  typically  a  single  gaussian 
per  class  for  classifiers  and  single  gaussian  predic¬ 
tion  models  for  data  compression  are  adequate  data 
representations  provided  the  data  is  preprocessed  to 
remove  non- stationary  behaviour. 


1  INTRODUCTION 

This  paper  develops  a  framework  for  establishing  performance  bounds  for 
ICEG  classification  and  compression  systems.  The  requirements  for  com- 
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pression  and  classification  are  different  in  emphasis  but  share  some  common 
features.  For  the  case  of  arrhythmia  classification  via  morphology  analysis 
(which  improves  diagnosis  for  some  arrhythmia  over  systems  which  rely  on 
heart  rate  alone),  the  states  of  the  heart  are  being  distinguished  by  trying  to 
determine  which  type  of  signal  the  heart  is  producing  given  some  previous 
knowledge  about  what  types  of  signals  are  produced  under  both  normal  and 
abnormal  conditions.  For  the  case  of  ICEG  data  compression,  a  concise  rep¬ 
resentation  of  the  signal  is  sought,  given  knowledge  of  the  signal  in  the  past, 
but  so  as  to  retain  a  great  deal  of  the  detail  in  the  signal.  Thus,  morphology 
classification  is  an  extreme  form  of  compression  that  retains  only  a  labelling  of 
the  signal,  whereas  compression  seeks  to  retain  the  diagnostically  significant 
part  of  the  local  structure.  In  this  paper  we  demonstrate  the  advantage  of 
common  scale  and  shift  invariant  preprocessing  for  these  two  applications  and 
then  using  this  representation,  model  the  transformed  signals  with  mixtures 
of  gaussians,  Gaussians  are  a  good  choice  for  real  valued  data  as  they  render  a 
model  which  may  be  interpreted  in  terms  of  well  known  parameters  and  may 
be  easily  manipulated  to  obtain  relationships  between  variables.  The  fitting 
of  a  mixture  of  gaussians  is  described  in  [McLachlan  and  Basford,  1987]. 

Section  2  introduces  the  modelling  steps  and  discusses  in  detail  those  steps 
common  to  both  classification  and  compression.  Section  3  describes  the  esti¬ 
mation  procedure  and  bounds  for  ICEG  morphology  classifiers.  The  classifier 
bounds  include  both  the  case  of  separating  NSR  (normal  heart  rhythm)  from 
VT  1:1  (dangerous  rhythm)  with  examples  of  both  available  and  the  blind 
separation  of  NSR  from  VT  given  only  examples  of  NSR.  Section  4  describes 
the  estimation  procedure  and  bounds  for  ICEG  data  compression. 

2  SOURCE  MODELLING  OF  THE  ICEG 

The  modelling  procedure  adopted  is  described  by  the  following  steps: 

•  Segmentation:  The  ICEG  time  series  is  segmented  into  “QRS  com¬ 
plexes”  ,  which  are  30  samples  wide  and  contain  the  diagnostic  infor¬ 
mation  (see  Figure  1). 

•  Identification:  The  non-stationary  behaviour  of  the  signal  is  removed 
and  an  initial  model  is  selected. 

•  Estimation:  The  model  is  fitted  using  the  Expectation  Maximisation 
(EM)  algorithm  [Dempster  et.  al.,  1977]. 

The  model  identification  steps  will  now  be  described  as  these  form  the  com¬ 
mon  preprocessing  steps  for  both  classification  and  compression.  The  de¬ 
scription  of  the  estimation  steps  is  deferred  to  the  respective  sections  on 
classification  and  compression. 

The  ICEG  is  a  periodic  non-stationary  signal.  There  are  several  mechanisms 
for  its  non-stationary  behaviour:  the  ICEG  is  a  measure  of  the  heart  pump¬ 
ing  blood  in  a  normally  synchronous  manner  which  leads  to  periodic  (sea¬ 
sonal)  non-stationarity;  the  ICEG  is  subject  to  longer  term  influences  such 
as  daily  cycles  in  metabolism,  exercise,  tissue  growth,  disease  and  aging.  For 
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Figure  The  morphology  of  NSR  and  VT  retrograde  1:1.  The  letters  QRS  are 
conventional  notation  to  label  segments  of  the  signal  which  correspond  to  those 
times  when  the  heart  is  actually  pumping  blood.  Hence,  this  period  is  often  referred 
to  as  the  QRS  complex. 


recordings^  lasting  from  minutes  to  hours  only  the  short  term  non-stationary 
behaviour  is  relevant.  Due  to  segmentation,  the  variance  in  the  period  is 
reduced  to  that  of  the  QRS  complex  detector  which  is  about  3  sample  inter¬ 
vals.  The  effect  of  variation  in  the  R  point  detection  (detection  of  the  signal 
peak)  can  be  removed  by  calculating  the  correlation  of  the  current  complex 
with  the  previous.  The  non-stationarity  in  the  mean  of  the  signal  may  be 
removed  by  the  standard  method  of  periodic  differencing.  Stationarity  in  the 
mean  is  indicated  when  the  auto-correlation  function  quickly  decays  to  small 
values  after  a  few  lags.  The  dashed  line  in  Figure  2(c)  shows  the  effect  of 
periodic  differencing  on  the  ICEG.  Figure  2(c)  (solid  line)  shows  evidence  of 
a  residual  signal  with  a  randomly  varying  amplitude.  The  origin  of  this  resid¬ 
ual  non-stationarity  is  the  A/D  converter  sampling  the  ICEG  asynchronously 
resulting  in  ^‘sample  jitter”.  This  jitter  process  can  be  described  by  the  subse¬ 
quent  QRS  complex  segment  y(t)  being  identical  to  a  complex  x(t)  but  with 
a  random  phase  shift  (f>.  Expressing  y{t)  as  a  Taylor  expansion  of  a;(f)  to  the 
first  derivative  term  around  t, 

y{t)  =  Jc(<)  4-  <j)x\t)  -f  0  (1) 

Thus,  the  residual  non-stationary  component  in  Figure  2(c)  is  identified  as 
a;'(f)  modulated  by  the  random  phase  variable  (j).  The  Taylor  expansion 
to  the  first  derivative  term  is  an  adequate  approximation  because  x{t)  is 

^  Data  in  this  paper  is  recorded  in  hospitals.  Arrhythmia  are  induced  artificiahy. 
Long  recordings  and  natural  arrhythmia  are  hard  to  get  due  to  lack  of  field  recording 
capability. 
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Figure  2:  The  effect  of  amplitude  and  phase  locking  to  the  previous  beat,  (a)  The 
original  segmented  ICEG  and  (b)  the  autocorrelation  function,  (c)  The  residual 
ICEG  after  amplitude  and  phase  locking  (solid  line)  compared  to  the  periodic  dif¬ 
ferenced  ICEG  (dashed  line)  and  (d)  the  resulting  autocorrelation  function.  The 
horizontal  lines  are  95%  bounds  for  a  white  noise  sequence. 


sampled  well  within  the  Nyquist  criterion  to  avoid  aliasing.  The  random  phase 
^  is  estimated  by  calculating  the  correlation  of  the  periodically  differenced 
sequence  with  the  derivative  of  the  previous  beat.  Hence,  the  jitter  free 
seasonally  differenced  sequence  is  given  by, 

Ay(i)  =  y{j)  -  x(j)  -  (f){x{j  +  1)  -  x{j  -  l))/2  (2) 

This  procedure  may  be  enhanced  by  doing  scaled  periodic  differencing  instead 
of  just  periodic  differencing  in  order  to  remove  amplitude  variations  between 
cycles.  This  can  be  achieved  by  calculating  a  scaling  factor  A,  from  which 
the  jitter  free  scaled  seasonal  difference, 

AysU)  =  y(j)  -  Ax{j)  -  <l>A{x(j  +  1)  -  x{j  -  l))/2  (3) 

is  obtained.  Figure  2(d)  shows  the  effect  of  Equation  3  on  the  autocorrelation 
function  and  is  used  for  compression.  Similarly,  Equation  2  can  be  used  to 
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Figure  3:  Learning  curves  for  a  classifier.  E  is  the  expectation  of  the  classifier 
error  over  all  possible  training  and  test  sets.  /  is  the  size  of  the  data  sample.  As 
I  becomes  large  the  training  and  testing  errors  converge  to  a  common  value  £'00. 
Above  a  certain  training  set  size  Ic  the  training  and  testing  curves  can  be  modelled 
with  power-law  decays. 


phase  lock  complexes  for  jitter  free  classification. 

The  remaining  identification  step  is  to  determine  the  model  order.  For  clas¬ 
sification  purposes  the  30  dimensional  phase  locked  complexes  comprise  the 
classifier  inputs.  For  compression  purposes,  a  choice  about  the  size  of  the 
memory  in  the  model  needs  to  be  made.  The  autocorrelation  function  af¬ 
ter  amplitude  and  phase  locking  to  the  previous  complex  shows  that  there 
is  significant  correlation  in  the  first  4  lags.  This  suggests  an  autoregressive 
model  for  the  time  series  with  4  AR  coefficients  corresponding  to  the  4  lags. 
A  preliminary  study  on  a  small  number  of  patients  showed  that  a  long  term 
component  in  the  model  did  not  contribute  significantly  to  a  lower  entropy 
estimate.  Hence,  short  term  model  orders  in  the  range  1  to  4  only  were 
considered.  This  determines  the  vector  for  which  the  distribution  is  to  be 
estimated  as 

X  =  — (4) 

where  x  slides  over  the  time  series  and  m  is  the  short  term  model  order  and 
M  is  the  model  capacity  (number  of  basis  functions  used). 

3  CLASSIFIER  PERFORMANCE  BOUNDS 

3.1  Model  Estimation  for  -Classifiers 

A  classifier  is  determined  by  both  its  structure  and  the  number  of  free  param¬ 
eters.  Learning  curves  are  defined  as  the  expectation  of  the  classifier  error 
over  all  equally  sized  training  and  testing  sets  that  may  be  randomly  selected 
from  the  data  sample  of  size  /.  In  [Cortes  et.  al.,  1994]  it  was  shown  that 
these  learning  curves  can  often  be  modelled  by  power-law  decays  to  obtain 
an  estimate  of  the  asymptotic  performance  of  the  classifier.  An  illustration 
of  these  learning  curves  is  depicted  in  Figure  3.  In  [Cortes  et.  al.,  1995]  it  is 
demonstrated  that  these  asymptotic  estimates  may  then  be  used  to  bound 
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Figure  4:  Example  of  how  the  asymptotic  classiher  error  behaves  as  a  function 
of  classifier  capacity.  Eventually,  a  capacity  is  reached  beyond  which  negligible 
improvement  in  classifier  performance  is  achieved. 


the  performance  of  any  classifier  on  that  data  by  considering  the  asymtotic 
classifier  error  as  the  capacity  is  increased.  An  illustration  of  an  asymptotic 
classifier  error  curve  is  shown  in  Figure  4. 

3.2  Bounds  for  NSR  and  VT  1:1  Classification 

The  efficacy  of  a  single  decision  classifier  may  be  expressed  in  terms  of  the 
expected  fraction  of  false  negative  and  false  positive  detections.  Here  we 
use  a  classifier  which  derives  the  classification  from  the  probability  density 
model  of  the  data.  This  more  general  representation  provides  a  natural  ex¬ 
tension  to  a  blind  classification  scheme  which  detects  the  presence  or  absence 
of  NSR.  A  mixture  of  gaussians  density  model  and  the  EM  algorithm  may 
be  used  to  model  labelled  data  using  a  procedure  similar  to  that  described 
in  [Ghahramani  and  Jordan,  1994].  In  order  to  fit  the  best  possible  model  an 
evaluation  of  the  appropriate  number  and  form  the  of  gaussians  (capacity) 
per  class  is  required.  These  issues  could  be  evaluated  using  learning  curves  or 
other  statistical  tests  [McLachlan  and  Basford,  1987].  As  a  starting  point,  a 
simple  model  was  chosen,  consisting  of  one  spherically  symmetric  gaussian  per 
class.  Such  a  simple  model  will  provide  an  upper  bound  for  the  classification 
error  rate. 

Table  1  summarises  the  misclassification  bounds  computed  using  the  simple 
spherical  gaussian  per  class.  The  %  correct  columns  are  averages  over  10 
different  training  and  testing  sets  constructed  by  randomly  splitting  the  sam¬ 
ple  sets  in  two  and  then  refitting  the  density  model.  The  sample  complexes 
are  sequentially  phase  locked  to  each  other  prior  to  fitting  the  models.  The 
table  shows  that  all  patients  indicate  a  misclassification  rate  of  less  than  1%. 
Since,  the  misclassification  rates  for  the  test  sets  are  already  very  low,  no 
attempt  was  made  to  obtain  a  tighter  bound  by  considering  the  convergence 
properties  of  the  training  and  testing  error  rates. 
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Complexes 

%  Correct 
training 

%  Correct 
testing 

%  Correct 
blind  testing 

VT 

mmm 

kh 

VT 

BUBI 

Qin| 

61 

99.6 

100 

98.9 

100 

inti 

71 

100 

100 

1  M 

BBitl 

99.4 

100 

177 

75 

100 

99.5 

B 

99.7 

97.7 

100 

no 

71 

100 

100 

B  ^ 

100 

99.3 

100 

38 

90 

100 

100 

100 

98.9 

100 

Table  1:  Estimate  of  misclassification  bounds  for  5  patients  with  VT  1:1  retrograde 
conduction  for  both  cl<issification  with  training  on  the  VT  data  and  without  training 
on  VT  (bUnd). 


Bounds  for  a  classifier  which  only  has  a  measure  of  the  NSR  probability 
density  may  be  computed  by  considering  only  the  density  estimation  of  the 
NSR  data  and  assigning  an  arbitrary  probability  threshold  for  considering  a 
data  point  to  be  a  member  of  the  NSR  class.  In  this  case  the  training  set 
misclassification  error  is  predetermined  by  the  acceptable  level  of  false  positive 
detections.  Testing  the  model  on  NSR  only,  then  indicates  the  adequacy  of 
the  model  as  an  estimate  of  the  NSR  complexes  probability  density  at  that 
error  rate.  By  requiring  the  training  error  rate  to  be  zero,  the  least  probable 
training  vector  determines  a  probability  threshold  for  NSR  membership.  The 
NSR  testing  set  then  gives  an  upper  bound  on  the  resultant  false  positive 
error  rate.  By  then  testing  the  model  against  VT  with  the  same  threshold 
the  sensitivity  to  VT  is  determined.  Table  1  shows  that  high  classification 
performance  is  predicted  with  a  single  spherical  gaussian  modelling  the  NSR 
density.  With  a  maximum  of  2%  false  positive  error  rate  the  simple  density 
model  appears  adequate  and  hence  more  complex  models  were  not  considered. 

4  DATA  COMPRESSION  BOUNDS 

This  section  describes  the  bounds  calculated  for  the  data  compression  of  the 
QRS  complex  of  the  ICEG.  Firstly,  a  generalisation  of  the  model  estima¬ 
tion  procedure  used  for  the  classifier  bounds  is  described  which  uses  learning 
curves  based  on  the  estimated  entropy  of  the  data.  The  method  is  then  ap¬ 
plied  to  the  ICEG  data  base  and  yields  bounds  for  lossless  data  compression. 

4.1  Model  Estimation  for  Compression 

The  goal  of  estimation  is  to  determine  a  probability  distribution  of  the  form, 
where  x  is  given  by  Equation  4  and  Oj  is  the  gaussian 
probability  distribution  function  given  by, 

r  f  )  -  -  fijf} 

(2!r)'"/2|Cj|‘/2 

Figure  5  shows  the  procedure  used  to  determine  P(x).  When  EM  is  used 
to  fit  the  model  it  maximises  the  log  probability  that  the  model  generated 
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Select  a  data  sample  and  Identify  an  initial  model; 
Initialise  sample-size; 

For  sample-size  <  max-samples;  sample-size  *  =  2;  { 

For  n-splits  =  0;  n-splits  <  total-splits;  n-splits++;  { 
randomly  split  the  sample  in  half 
into  training  and  test  sets; 

Use  EM  to  fit  model; 

Determine  Htrain{M), 

}  ^ 

Estimate  Hoo{M)  aind  Increase  M\ 

Continue  until  M  determined; 


Figure  5:  Procedure  for  determining  M  and  P(x).  The  value  of  total-splits  used 
was  10.  The  sample  size  was  varied  over  a  range  of  8. 


the  training  data,  given  by  L  =  ]~[j^  log2  P{xi)  where  N  is  the  size  of  the 

training  set.  Notice  that  L  is  closely  related  to  the  average  information  of 
the  training  data  with  respect  to  the  model.  In  fact  L  =  —H{M,  N)  where 
H  is  an  estimate  of  the  entropy  of  the  data  source  based  on  a  sample  size  N 
and  a  model  of  capacity  M.  Hence,  given  a  model  and  training  and  test  sets 
both  of  size  N,  Htrain{M,  N)  and  Htegt{M,  N)  can  be  calculated.  Analogous 
to  the  procedure  of  [Cortes  et.  al.,  1994]  the  behaviour  of  Htest  and  H train  is 
considered  as  iV  is  increased^.  In  the  limit  as  N  —*■  ooy  Hteat  =  Htrain  =  ^oo- 
Hqq  is  then  the  source  entropy  estimate  for  the  vector  x  as  measured  by  the 
model  H{M).  If  both  Huat  and  Htrain  approach  this  limit  at  equal  rates 
then  an  unbiased  estimate  of  Hoo  is  obtained  via 

_  Htrain  Htest(M,  N) 

Hoo  —  2 

If  the  rates  of  convergence  are  unequal  the  procedure  of  [Cortes  et.  al.,  1994] 
can  be  used  which  involves  fitting  power  laws  to  the  learning  curves  for  H 
over  a  range  of  N  to  deduce  the  unbiased  estimate.  Having  obtained  an 
estimate  of  Hoo,  more  complex  models  can  be  fitted  by  increasing  M  until 
no  significant  reduction  in  Hoo  is  obtained.  By  this  means,  the  capacity  of 
the  model  from  a  finite  sample  is  obtained,  while  also  determining  an  upper 
bound  on  the  entropy  of  the  source. 

4.2  Results  for  the  ICEG 

In  this  section  mixture  of  gaussian  models  are  fitted  to  the  ICEGs  of  sev¬ 
eral  patients  who  are  candidates  for  ICD  implantation  using  the  procedures 
described  in  Section  2.  Single  patients  and  a  single  rhythm  type  for  that  pa¬ 
tient  are  considered.  Table  2  shows  the  variety  of  models  required  for  various 

^Actually  £'[jfir]  is  calculated  where  the  expectation  is  over  all  possible  choices 
of  training  and  testing  sets.  In  practice  only  the  expectation  using  a  small  number 
(10)  of  random  choices  is  estimated. 
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Patient 

Rhythm 

N 

Hoo 

Hiin 

S000418 

NSR 

3 

688 

8.6 

9.3 

S000418 

VT 

4 

703 

9.4 

10.6 

045.vts 

NSR 

3 

719 

7.3 

9.0 

045.vts 

VTR 

3 

719 

8.0 

9.5 

55.vts 

NSR 

3 

719 

6.9 

7.9 

55.vts 

VTR 

2 

719 

8.4 

9.4 

S000518 

VF 

4 

719 

10.1 

10.3 

S000533 

NSR 

2 

719 

5.6 

5.9 

S000533 

VF 

3 

719 

8.1 

8.5 

S000513 

NSR 

3 

719 

5.6 

6.5 

S000513 

VT 

2 

719 

6.1 

6.5 

S000513 

VF 

3 

719 

6.6 

7.5 

Table  2:  ICEG  probability  model  capacity  and  Hoo  estimates.  The  entropy  esti¬ 
mates  are  in  units  of  bits  per  (4ms)  sample. 


patients  and  various  rhythms.  The  Hoo  shown  is  obtained  using  the  simple 
estimate  of  Equation  6.  The  conditioned  probability  model  P(ic|x,)  can  be 
easily  derived  from  the  fitted  model  to  give  the  time  series  entropy  estimate 
The  column  Hun  is  included  to  compare  the  result  with  the  fitting 
of  a  single  gaussian.  The  model  capacity  M  was  determined  conservatively 
by  only  taking  higher  values  of  M  when  there  is  no  evidence  for  Hoo 
creasing  as  determined  by  the  bounding  standard  deviation  curves.  Having 
obtained  the  distribution  of  the  data,  the  conditional  entropy  H(x\xs)  can 
be  determined,  where  X5  are  the  previous  short  and  long  term  outputs  of 
the  source.  H^(x|x,)  represents  an  upper  bound  on  the  true  source  entropy. 
The  true  source  entropy  will  only  be  reached  if  the  source  is  finite  memory 
and  autoregressive  and  it  is  stationary  and  ergodic.  Of  these  conditions  the 
ICEG  is  most  likely  to  fail  on  stationarity,  as  it  is  difficult  to  guarantee  that 
all  higher  order  moments  are  stationary.  Therefore,  in  practice  only  upper 
bounds  on  the  source  entropy  of  the  ICEG  data  can  be  achieved. 

The  three  heart  rhythms  NSR,  VT  and  VF  correspond  to  three  therapeutic 
groupings  of  the  ICD,  being  no  therapy,  pacing  and  defibrillation.  The  mor¬ 
phologies  are  usually  noticeably  different  for  the  three  rhythms.  NSR  and 
VT  are  quite  periodic  where  as  VF  is  quite  variable  in  both  period  and  am¬ 
plitude.  Table  2  shows  that  between  2  and  4  gaussians  are  sufficient  to  model 
the  data.  The  reduction  in  Hoo  is  both  patient  and  rhythm  dependent  with 
between  2%  and  20%  reduction  compared  to  the  fit  of  a  single  gaussian.  Over 
the  entire  data  base  of  146  patients  a  15%  reduction  in  the  entropy  estimate 
for  the  NSR  class  was  obtained  by  fitting  a  mixture  model,  11%  for  the  VT 
class  and  8.5%  for  the  VF  class. 
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5  CONCLUSION 


Mixtures  of  gaussians  models  were  chosen  due  to  their  clear  interpretation 
and  their  utility  for  analysing  both  data  compression  and  classification  prob¬ 
lems.  Amplitude  and  phase  locking  was  identified  as  an  important  common 
preprocessing  step  for  both  tasks.  Classifiers  were  modelled  using  single  gaus¬ 
sians  per  class  and  learning  curves  were  used  to  determine  misclassification 
bounds.  Both  the  separation  of  NSR  from  VT  1:1  with  training  sets  of  both 
and  blind  separation  were  considered  for  5  patients.  Bounds  close  to  100% 
correct  classification  were  indicated  for  both  classification  tasks.  Data  com¬ 
pression  bounds  were  obtained  using  mixture  of  gaussians  models  fitted  using 
the  EM  algorithm.  The  capacity  of  the  models  was  determined  by  using  a 
generalisation  of  the  learning  curve  procedure  using  the  entropy  estimates. 
Experiments  on  the  ICEG  data  showed  that  less  than  5  gaussians  were  re¬ 
quired  for  the  modelling  of  the  data  and  that  the  resulting  reduction  in  en¬ 
tropy  estimate  over  the  single  gaussian  case  was  8%  to  15%  averaged  over  a 
data  base  of  approximately  150  patients  depending  on  the  rhythm  class.  This 
is  to  be  contrasted  with  the  previous  simpler  Gauss-Markov  modelling  of  the 
ECG,  however  indicates  the  adequacy  of  the  use  of  linear  coding  approaches. 
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Abstract.  Automation  of  anaesthesia  is  a  complex  task  but  impor¬ 
tant  with  respect  to  patient  health,  improved  quality  of  narcosis  and 
cost  reduction.  Furthermore  it  will  enhance  our  understanding  of  the 
complex  mechanism  underlying  anaesthesia. 

Classical  model  based  control  concepts  have  been  evaluated  in  the 
past.  Those  approaches  were  limited  to  univariate  process  control. 
As  the  objective  of  our  studies  we  want  to  establish  the  feasibility  of 
different  real-valued  reinforcement  learning  approaches  for  the  task 
of  multivariate  adaptive  control  in  anaesthesia. 

As  a  first  step  we  present  a  series  of  experiments  with  a  naive  applica¬ 
tion  of  reinforcement  learning.  The  appropriateness  is  demonstrated 
in  the  univariate  case.  Results  are  compared  to  a  model  based  ana¬ 
lytical  controller. 


1  Introduction 

Automatic  control  in  anaesthesia  is  a  powerful  tool  to  improve  narcosis  with 
respect  to: 

1.  patient  specific  supply  of  anaesthetic  agents  decreases  necessary  concen¬ 
tration  levels  to  a  minimum, 

2.  support  for  the  anaesthesist  in  case  of  temporally  delayed  effects  or  non¬ 
linear  combination  of  effects, 

3.  development  of  a  theoretical  foundation  of  anaesthesia 

4.  cost  reduction 

Automatic  model  based  control  of  volatile  anaesthetics  exploiting  the  me¬ 
dian  EEG-frequency  (MEF)  was  successfully  applied  in  clinical  trials  [Sch95]. 
This  particular  approach  was  univariate,  i.e.  the  median  EEG  frequency, 
MEF  was  used  as  input  to  the  controller  (state  characterization)  to  ad¬ 
just  the  vaporizer  setting  (CVap)  as  effector^.  An  explicit  invertible  model 

^  The  vaporizer  setting  influences  the  concentration  of  the  anaesthetic  agent  in  the 
inhaled  gas,  thus  controlling  the  patients  anaesthetic  depth.  Usually  EEG  median 
frequency  (MEF)  is  monitored  as  an  indicator  for  anaesthetic  depth.  A  value  of 
lOHz  MEFcharacterizes  a  person  which  is  awake  while  a  narcotized  person  has 
a  MEF  of  less  than  3-4Hz. 


0-7803-4256-9/97/$  10.00  ©1997  IEEE 
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is  required  capturing  the  dependence  of  control  parameters  and  effects.  Due 
to  mathematical  restrictions  this  approach  can  not  be  generalized  to  more 
than  one  narcotic  agent.  Also  multiple  parameters  of  the  patient  can  not  be 
included. 

Reinforcement  based  learning  [Sut88,  Tes92,  KLM96]  has  recently  matured 
towards  a  technique  applicable  to  multidimensional  real  world  tasks.  In  a 
project  we  explore  the  appropriateness  of  reinforcement  learning  systems  for 
automation  in  anaesthesia. 

2  Methods 

2.1  Analytic  model  based  closed  loop  control 

Every  approach  to  design  a  controller  requires  knowledge  about  the  system 
to  be  controlled.  Usually  this  knowledge  is  represented  by  an  explicit  model 
of  the  plant.  By  restricting  the  number  of  parameters  to  one  input  and  one 
output  variable  a  narcotized  patient  can  approximately  be  modelled  by  the 
CV ap/MEF  relation. 

The  applicability  of  a  model  of  five  first  order  compartments  was  demon¬ 
strated  in  [Sch95].  It  was  inverted  using  the  Laplace-transform,  thus  a  con¬ 
troller  for  the  system  could  be  derived. 

To  fit  the  patients  characteristics  the  models  parameters  must  be  adapted 
during  control.  Quality  of  control  relies  on  these  prameter  settings.  The  con¬ 
trol  scheme  is  depicted  in  Fig.  1(a). 


Fig.  1.  A  schematic  overview  of  the  applied  controllers,  (a)  Model  based  controller 
with  analytically  derived  invers  model,  (b)  Q  learning  controller  for  heuristic  adap¬ 
tive  direct  control.  S  denotes  sensor  evaluation  and  R  reward  determination. 

Even  if  this  procedure  leads  to  acceptable  results  in  the  univariate  case  it 
has  substantial  restrictions.  The  use  of  an  explicit  invertible  model  hinders 
the  application  of  more  complex  and  particularly  multivariate  models. 
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A  control  scheme  that  does  not  rely  on  explicit  modelling  might  overcome 
these  limitations. 


2.2  Reinforcement  learning 

Reinforcement  learning  is  a  method  for  heuristic  adaptive  direct  control 
[Sut88].  Fig.l(b)  illustrates  Q-learning  [Wat89]  an  instance  of  reinforcement 
learning.  During  learning  an  evaluation  function  (Q-function)  is  approxi¬ 
mated  that  represents  the  quality  of  a  state  action  pair  with  respect  to  the 
task.  For  training  the  system  interacts  with  it’s  environment  which  can  either 
be  a  real  patient  or  a  model  of  any  complexity.  Actions  (vaporizer  settings) 
to  be  applied  in  a  certain  state  are  selected  probabilistically  according  to  the 
Boltzmann  distribution  based  on  their  Q- values. 

Thus  control  does  not  depend  on  an  invertible  model,  or  even  any  explicit 
model  at  all.  It  therefore  has  the  potential  of  controlling  multivariate  systems 
which  are  difficult  to  model  or  impossible  to  invert.  In  principle  reinforcement 
learning  may  overcome  the  limitations  of  the  analytic  approach. 

A  major  drawback  of  reinforcement  learning  which  has  to  be  mentioned 
here  are  certain  strong  requirements  in  the  controlled  system.  The  conver¬ 
gence  of  Q-learning  to  an  optimal  control  policy  [WD92]  is  proven  if: 

1.  Q- values  are  stored  by  table  lookup 

2.  the  controlled  system  is  a  Marcov  decision  process 

3.  every  state-action  pair  is  evaluated  arbitrarily  often 

4.  appropriate  learning  parameters  are  selected 

While  3)  and  4)  are  usually  satisfiable  1)  and  2)  are  often  violated  by  real 
world  tasks.  If  a  task  is  real  valued  the  Q-funtion  cannot  be  stored  by  lookup 
tables.  Discretization  can  be  used  to  solve  this  problem  provided  that  appro¬ 
priate  intervals  are  chosen.  Function  approximators  such  as  artificial  neural 
networks  can  also  be  used  to  circumvent  the  difficulty  of  choosing  a  good 
discretization  and  to  speed  up  learning  by  generalization.  Both  approaches 
will  be  used  in  our  experiments. 

The  problem  of  non-marcovian  processes  is  eased  by  using  a  finite  history 
of  state-action  information.  Unfortunately  every  increase  in  dimensionality 
of  the  state  augments  time  to  convergence  by  a  multiplicative  factor  if  table 
lookup  is  used.  Additional  information  for  the  decision  problem  has  to  be 
traded  off  against  time  affordable  to  explore  the  exponentially  growing  state 
space  [TM92].  If  the  extended  state  space  is  of  uniform  structure  function 
approximators  may  advantageously  be  applied. 

Despite  of  the  obvious  use  of  function  approximation  in  value  function 
representation,  convergence  of  the  learning  process  can  longer  be  guaranteed. 
The  convergence  properties  of  Q-learning  [Ast95]  strongly  depends  on  the 
representation  of  the  Q-function  and  policy  and  even  if  learning  converges 
the  final  policy  might  not  be  optimal  [Heg96]. 
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3  Experimental  setup 


The  objective  of  our  learning  studies  is  to  develop  and  to  optimize  a  control 
policy  that  stabilizes  the  MEF  at  a  value  of  2.5Hz.  For  training  a  num¬ 
ber  of  narcosis  sessions  are  simulated  on  the  bsisis  of  the  five  compartment 
model,  described  by  [Sch95].  Each  of  the  sessions  consistes  of  240  minutes  of 
simulated  narcosis  time. 

Uniform  noise  is  added  to  the  MEFto  keep  the  problem  as  realistic  as 
possible.  In  addition  artefacts  are  simulated  to  approximate  the  clinical  sit¬ 
uation. 

As  a  reference  point  in  our  evaluation  the  analytic  controller  (i.e.  the 
invers  of  the  model)  was  used  to  control  the  model.  In  the  following  figure 
(Fig. 2)  an  example  of  an  optimal  control  is  shown.  Please  not  that  no  adap¬ 
tation  of  model  parameters  is  necessary  and  best  actions  can  be  determined 
analytically.  In  contrast  to  later  experiments  regulation  is  performed  every  5 
minutes  (0.0033  Hz). 


Fig.  2.  The  control  behavior  of  the  model  based  controller.  Control  is  considered 
successful!  if  the  MEF  stays  between  2.1  and  2.9.  Parameters  were  selected  in  a 
way  that  no  adjustments  in  the  controller-inherent  model  were  necessary. 


The  reinforcement  controller  is  trained  online,  i.e.  while  controlling  the 
model  knowledge  is  acquired  in  the  Q-approximation  and  a  control  policy 
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develops.  Regulation  is  performed  every  minute  (l/60Hz).  As  long  as  a  sig¬ 
nificant  improvement  is  observed  training  is  continued. 

Both,  the  model  with  and  without  noise  are  investigated  for  training. 

Unfortunately  the  actual  MEF  alone  does  not  include  enough  state  in¬ 
formation  to  enable  proper  learning.  Therefore  a  finite  history  of  these  values 
is  provided  to  the  system.  Because  of  the  exponentially  growing  state  space 
history  size  is  limited  to  just  a  single,  the  last  MEF  value.  In  the  next  section 
it  will  be  shown,  that  this  is  sufficient  to  acquire  reasonable  control  policies. 

In  first  experiments  table  lookup  is  used  to  represent  the  Q-function.  The 
neccesary  discretization  of  the  MEF  and  the  CVap  settings  are  realized  by 
50/20  intervals  respectively.  Despite  this  coarse  grained  discretization  reason¬ 
able  control  strategies  can  be  developed.  A  higher  resolution  may  even  lead 
to  better  results.  In  current  experiments  different  function  approximators  are 
evaluated. 

4  Results 

After  some  training  sessions  the  current  policy  is  evaluated  deterministically 
i.e.  by  selecting  vaporizer  settings  with  maximal  estimated  utility  (Q- value). 
A  sample  run  is  plotted  in  Fig.3.  It  shows  a  similar  behavior  as  the  optimal 
analytic  controller  although  in  table  (Tab.l)  still  some  differences  become 
apparent. 


Fig.  3.  The  control  behavior  of  a  reinforcement  learning  system,  with  a  history  of 
one  state,  (a)  using  table  look-up  (b)  using  a  real  valued  approximator. 

To  visualize  a  learned  control  policy  noise  as  well  as  artifact  simulation 
are  turned  off  (cf.  Fig.4(a)).  Now  a  deterministic  system  has  to  be  controlled 
and  the  pure  control  strategy  can  be  examined. 

Obviously  the  learned  policy  is  not  able  to  keep  the  MEF  at  a  stable 
level.  This  is  not  a  principle  weakness  of  the  method  but  is  caused  by  the 
discretization  of  the  input  and  output  signals.  Adjustment  of  the  vaporizer 


can  not  be  done  precisely  because  the  desired  settings  usually  fall  into  dis¬ 
cretization  intervalls.  Real  valued  Q-learning  algorithms  as  shown  in  Fig.  3(b) 
overcomes  this  problem. 

Policies  however  do  not  converge  toward  a  unique  stable  strategy.  Inter¬ 
estingly  during  learning  a  variety  of  strategies  can  be  observed  such  as  an 
oscillating  one  depicted  in  figure  4(b). 


Fig.  4.  (a)  The  same  control  policy  as  depicted  in  figure  3(a)  but  without  noise  and 
artifacts  added  to  the  model,  (b)  The  learned  oscillating  control  policy. 


To  evaluate  learned  policies  four  different  features  are  considered:  a)  the 
time  until  a  MEF  of  2.5  was  reached  (inital  phase  of  anaesthesia),  b)  the 
actual  mean  value  of  the  median  during  anaesthesia  (steady  state),  c)  the 
standard  deviation  of  MEF  i.e.  smoothness  of  control  and  d)  the  average 
dosis  of  the  anaesthesic  applied  to  the  patient.  Those  values  were  compared 
to  the  corresponding  optimal  values  of  the  analytic  controller  and  a  random 
controller. 

The  comparison  of  mean-value,  standard  deviation,  time  to  2.5  and  the 
dosis  shows  that  the  reinforcement  learning  controller  based  on  table  look  up 
was  not  able  to  reach  the  peek  performance  of  the  analytic  controller,  but 
that  it’s  performance  is  by  far  better  than  a  reinforcement  learning  controller 
without  history  or  even  random  control  methods.  A  first  experiment  with  a 
real  valued  approach  shows  promising  results,  that  are  summarized  in  the 
following  table. 


5  Conclusions  and  future  work 

Control  of  the  Q-learning  system  is  -even  when  using  table  look-up  as  repre¬ 
sentational  formalism  -  within  few  iterations  acceptably  close  to  the  analytic 
solution.  In  the  current  experimental  setting  the  performance  of  the  analytic 
controller  represents  an  upper  limit.  Optimality  though  relies  on  the  appro- 
priatness  of  the  invertible  model.  It  has  to  capture  the  characteristics  of  the 
specific  patient  to  be  narcotized.  As  soon  as  the  model  structure  does  not  fit 
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Method 

Time  to  2.5 

Mean  Value 

Std.  Deviat. 

0  Dosis 

Model  based 

5  min  36  sec 

2.49 

0.40 

0.724 

RL  real  val. 

7  min  40  sec 

2.51 

0.40 

0.738 

RL  w.  hist.  1 

9  min  40  sec 

2.34 

0.54 

0.742 

RL  wo.  hist. 

22  min 

1.67 

1.07 

0.809 

Random 

- 

0.77 

1.88 

1.224 

Table  1.  A  comparision  of  some  control  techniques:  i)  analytical  model  based  con¬ 
trol,  ii)  reinforcement  learning  with  a  real  valued  representation,  iii)  reinforcement 
learning  with  a  one  step  history  using  table  look-up,  iv)  reinforcement  learning 
without  history,  v)  a  random  controller 


the  real  characteristics,  e.g.  in  case  of  a  real  patient,  learning  controllers  may 
perform  better,  because  they  are  not  limited  by  the  model  structure. 

In  this  paper  it  could  be  demonstrated,  that  even  plain  Q-learning  is 
capable  of  performing  near  optimal  control  of  univariate  anaesthesia.  Noise 
and  artifact  simulation  does  not  prevent  Q-learning  from  converging.  Training 
is  slightly  less  efficient  than  with  a  pure  deterministic  model  but  convergence 
is  more  robust  and  resulting  strategies  are  produced  more  stable. 

Still  a  number  of  problems  are  encountered  that  have  to  be  solved  before 
moving  to  the  more  challenging  multivariate  case. 

1.  Real  valued  approaches  need  to  be  explored  more  intensively.  Currently  a 
variety  of  approximators  are  under  investigation  including  artificial  neu¬ 
ral  network  approaches  (backprobagation,  radial  basis  function  networks, 
feature  maps  and  extensions  to  those),  memory  based  techniques  and 
mathematical  interpolation  methods. 

2.  The  applied  state  description  does  not  fully  capture  the  models  state. 
The  process  to  be  controlled  does  not  possess  the  desired  marcov  pro¬ 
perty.  A  problem  of  instablility  of  the  solution  arises.  First  results  with 
the  integration  of  more  MEF  values  and  of  a  past  vaporizer  setting  are 
promising. 

3.  Finally  a  more  sophisticated  reward  function  has  to  be  used  that  e.g.  pena¬ 
lizes  consumption  of  anaesthetics.  It  should  as  well  consider  aspects,  such 
as  the  smoothness  of  vaporizer  settings. 
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ABSTRACT 

Recent  advances  in  wavelet  theory  are  affording  great  opportunities  for 
signal  processing  applications.  Natural  neuronal  networks  exhibit  wavelet 
behavior  from  which  structural  and  functional  paradigms  could  be  exploited 
for  machine-vision  applications.  Provided  here  is  a  summary  of  the  ways 
vertebrate  vision  systems  naturally  exhibit  wavelet  characteristics. 


INTRODUCTION 

Wavelets  can  be  generally  defined  as  little  waves  that  start  and  stop  and  originate 
from  a  single  basic  function  [18].  The  Wavelet  Transform  decomposes  signals 
into  their  wavelet  components  as  the  Fourier  Transform  decomposes  signals  into 
their  frequency  components.  Fourier  time-frequency  representations  are 
presented  here  as  an  introduction  to  the  more  general  wavelet  representations. 

Early  vision  can  be  defined  as  the  processes  that  recover  the  properties 
of  object  surfaces  from  2D  intensity  arrays  [6].  The  structure  and  function  of 
natural,  vision  systems  exhibit  wavelet  characteristics  in  many  ways.  The  focus 
here  i:-  o.i  vertebrate  vision  information  pathways  that  begin  in  the  retina  and 
terminate  in  cortical  processing  stages.  Many  of  these  concepts  are  also  common 
in  insect  vision. 


CONVENTIONAL  TIME-FREQUENCY  REPRESENTATIONS 

FouricT  developed  a  series  of  weighted  sine  and  cosine  terms  to  represent  a 
periodic  waveform  f(t)^  with  period  T  =  27r/ci)o,  where  cOo  is  the  fundamental 
radian  fre<]uency.  The  Fourier  Series  is  an  infinite  sum  of  components,  weighted 
by  coefficients  an,  at  integer,  n,  multiples  of  coo  [15,20]: 

m  =  I  (1) 

As  T  in  :reases,  coo  decreases  along  with  the  frequency  spacing  between  Fourier 
Series  components  (or  terms).  As  T  oo,  the  frequency  distance  between 
components  becomes  infinitesimal  so  that  coo  ,  a  continuous  variable.  The 
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solution  f(  r  coefficients,  when  integrated  over  one  sample  period,  becomes  the 
Fourier  Transform: 

(F/)((o,f)  =  I  dtf(t)e^'“  =  <f(t).e^> ,  (2) 

where  <::*  denotes  the  inner  product. 

For  real  applications,  functions  of  time  are  analyzed  in  finite  intervals. 
To  transform /(O  into  its  frequency  components,  the  signal  is  assumed  periodic  in 
the  sampling  interval.  The  duration  of  this  interval  becomes  the  fundamental 
period  7,  which  also  defines  the  frequency  resolution  Aco  =  coo.  For  long 
sampiii  g  intervals  (large  7),  frequency  details  are  well  resolved  (small  Aco),  but 
the  loca.ion  in  time  of  specific  events  within  T  is  not  known.  Compensating  for 
this  by  reducing  T  around  a  specific  time  to  results  in  large  Aco,  hence  an 
uncertainty  of  specific  frequency  components.  This  tradeoff  between  time  and 
frequency  resolution  is  known  as  the  time-frequency  uncertainty. 

An  alternative  to  a  specified  duration  of  f(t}  is  to  window  f(t)  with  a 
Gaussiandike  function,  g(t),  centered  at  time  to .  This  windowing  operator,  7^'", 
serves  to  weight  the  signal  closest  to  time  to  more  heavily  than  the  signal  farther 
away  in  lime  while  maintaining  a  reasonably  large  T  for  adequate  frequency 
resolution.  In  addition,  the  decay  of  reduces  the  undesirable  effects  of  Gibb’s 
phenomenon  at  the  edges  [15].  The  result  is  a  Fourier  transform  that  is  a 
functir  n  of  to ,  called  the  Windowed  Fourier  Transform'. 

)(C0,  fo )  =  \dtf(  t}g(  t-to)e^  =  <f(t),g(  t-to)e^>  (3) 

'fhe  Discrete  Windowed  Fourier  Transform  results  in  integer  m  sampled 
specU'ai  t^)mponents  (wcoo)  derived  fi-om  integer  n  sampled  time  intervals  (nt)  of 
a  sign'll.  It  is  given  as  [7] 

=  \dtfmt-nto)e^’^  =  (4) 


where  Gn^Jt)  = 


WA\  itLET  TIME-FREQUENCY  REPRESENTATIONS 

Given  an  appropriate  function  space,  the  set  of  functions  generated  by  recursive 
applications  of  the  scaling  function  to  the  mother  wavelet  constitutes  a  set  of 
basis  functions  called  a  wavelet  series  [4].  The  Continuous  Wavelet  Tran^orm 
(CWT'i  is  similar  to  the  Fourier  Transform  (or  Windowed  Fourier  Transform)  in 
that  bi)d‘  involve  an  inner  prcxluct  of  an  input  function  with  a  scaling  function. 
The  CV/T‘  is  given  as  [7] 
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(T”7)(a,6)  =  \af'^\dtf{tyif*{{t-b)la) 

=  </(0,X|/((i-*)/a)>.  (5) 

Variable  a,  the  dilation  parameter,  controls  the  interval  of  compact  support,  or 
duration,  for  the  wavelet  x/*'*  =  Variable  b,  the  translation 

parameter,  determines  where  in  time  the  wavelet  exists.  The  Discrete  Wavelet 
Transform  (DWT)  is  a  collection  of  samples  of  the  CWT,  sampled  at  m 
incremental  dilations  and  n  incremental  translations,  given  as  [7] 

(/)  =  W'^ldtf(t)\/iao'”t-nbo) 

=  \a(Jl'"^<f(.t),^\l(,ao""t-nbo)>  (6) 

For  a  given  application,  the  optimal  basis  functions,  or  wavelet  series 
componcnis,  must  be  chosen  as  well  as  the  optimal  scaling  and  translation 
parameters.  These  selections  make  the  application  of  wavelet  transforms  more 
complicated  than  that  of  Fourier  transforms. 

Consider  a  DWT  implemented  with  unity  dilation,  Oo  =  1,  and  a  mother 
wavelet  mat  is  a  windowed  complex  exponential,  =  g(t-nto)e'^"''^;  in  this 
case  ihe  DWT  becomes  the  DFT.  Using  unity  dilation  defeats  the  utility  of 
allowing  for  a  better  balance  between  frequency  resolution  at  lower  frequency 
components  and  time  resolution  at  higher  frequency  components.  However,  this 
demonstrates  that  the  Fourier  Transform  can  be  thought  of  as  a  special  case  of 
the  Wavelet  Transform. 


STRUCTURE  AND  FUNCTION  OF  EARLY  VISION 

Natuial  vision  filtering  begins  with  photonic  refraction  through  the  cornea  and 
lens.  The  incoming  light  then  passes  through  the  vitreous  humor  and  retinal  cell 
tissue,  and  is  focused  onto  a  photoreceptor  mosaic  surface.  Photonic  energy  is 
converted  to  electronic  charge  in  the  photopigment  discs  of  the  photoreceptors 
(rods  and  t  ones).  The  photoreceptors,  with  the  help  of  a  layer  of  horizontal  cells, 
spread  the  charge  in  space  and  time  within  a  local  neighborhood  of  other 
receptors.  The  spread  charge  and  photoreceptor  charge  are  both  available  at  the 
root  of  die  photoreceptor,  at  the  triad  synapse.  The  bipolar  cells  connect  to  triad 
synapses  and  presumably  activate  signals  proportional  to  the  difference  between 
the  photoreceptor  input  and  the  horizontal  cell  input.  Both  polarities  exist,  called 
i?/2-bipolars  or  <3jf-bipolars,  which  respond  to  light  and  darkness  respectively 

[11.14.2?;i.  .  .  ^ 

The  photoreceptor  charge  is  influenced  by  gap  junctions  between 
adjac(nt  photoreceptors.  The  response  from  a  photoreceptor  aggregate  can  be 
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modeled  as  a  spatial-temporal  Gaussian  with  a  small  variance.  The  input  from 
the  neighboring  aggregate  of  horizontal  cells  can  be  modeled  with  a  similar 
Gaussian  with  a  larger  variance.  The  differencing  function  results  in  the 
diJference-of-Gaussian  (DOG)  filter  operation,  resulting  in  a  center-surround 
antagonistic  receptive  field  profile.  DOG  and  Laplacian-of-Gaussian  (LOG) 
functions  have  been  used  to  model  the  bipolar  cell  output  [14]. 

The  analog  charge  information  in  the  retina  is  fiinneled  into  information 
pathways  as  it  is  channeled  from  the  mosaic  plane  to  the  optic  nerve  [3].  The 
channels  that  exist  there  are  the  rod  channel,  initiated  by  rod  bipolars,  the 
parvocellular  pathway  (PP)  and  the  magnocellular  pathway  (MP),  the  latter  two 
initiated  by  cone  bipolars  [10].  Both  the  PP  and  the  MP  exhibit  center-surround 
antagonisuc  receptive  fields.  PP  cones  are  tightly  connected,  responding  to  small 
receptive  fields,  while  the  MP  cones  are  more  loosely  connected  (together  with 
rod  inputs),  responding  to  large  receptive  fields. 

The  MP  and  PP  perform  separate  spatial  band-pass  filtering,  provide 
color  and  intensity  information,  and  also  provide  temporal  response  channels,  as 
illustrated  in  Figure  1.  A  relatively  high  degree  of  acuity  is  achieved  in  each 
domain  fr(*m  these  few  filters.  The  MP  is  sensitive  to  low  spatial  frequencies  and 
broad  c(  k  r  intensities,  which  provide  basic  information  of  the  objects  in  the 
image.  The  PP  is  known  to  be  sensitive  to  higher  spatial  frequencies  and 
chromatic  differences,  which  add  detail  and  resolution  [17].  In  the  color  domain, 
the  PP  provides  color  opponency  and  thus  spectral  specificity,  and  the  MP 
provides  color  non -opponency  and  thus  overall  intensity  [9,12].  In  the  time 
domain,  the  PP  provides  slowly  varying  dynamics,  while  the  MP  provides 
transient  responses  to  image  dynamics. 


Input 

Imagery 


Parvocellular 
Pathway  (PP) 


Magnocellular 
Pathway  (MP) 


Local  Spatial  Detail 
Local  Slow  Dynamics 
Local  Color 

Local  Spatial  Average 
Local  Temporal  Transients 
Local  Intensity 


Figure  1.  Natural  Vision  Information  Channels. 

The  '.:olor  opponent  PP  responds  to  spatial  detail,  slowly-vaiying  image 
d^r.anucs,  and  chromatic  detail.  The  color  non-opponent  MP  responds  to 
sp.itiril  averages,  rapid  transients,  and  intensity  variations. 
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WAVELET  FEATURES  INHERENT  IN  VISION  PROCESSING 


Wavelet  bases  can  be  subdivided  into  orthogonal  or  nonorthogonal  and 
complete  or  noncomplete  categories  [21].  A  set  of  basis  functions  is  orthogonal  if 
the  inner  product  of  any  two  different  basis  functions  is  zero,  and  complete  if  no 
non-zer)  function  in  the  space  is  orthogonal  to  every  basis  vector  [4,22]. 
Orthogonality  and  completeness  ensure  a  unique  transformed  representation  of 
each  function  within  the  given  space.  These  are  the  general  requirements  for  the 
selection  of  wavelet  bases  in  compression  applications. 

However,  biological  systems  are  not  concerned  with  information  storage 
for  perfect  reconstruction.  Any  machine- vision  application  requiring  some  action 
to  be  token  based  on  an  understanding  of  the  image  content  will  also  fit  this 
general  de  ?cription.  In  fact,  many  biological  processing  functions  are  considered 
to  be  non -orthogonal  [8].  The  task  is  processing  information  in  order  to  lake 
some  action,  not  processing  information  for  later  reconstruction.  The 
redundancy  of  vision  filters  is  balanced  by  the  need  for  efficiency,  simplicity,  and 
robustness.  Information  redundancy  results  in  unnecessary  hardware  and 
interctnncctions,  but  often  redundancy  may  be  required  to  sufficiently  span  the 
information  space  inherent  in  the  environment.  The  cost  of  supporting  the 
redundancy  may  be  less  significant  than  the  benefit  of  using  simpler  processing 
elemenis  that  degrade  gracefully. 

Wavelet  Filter  Banks  and  Vision  Pathways 

I'he  MP  and  PP  decompose  the  natural  input  image  into  local  average 
and  local  detail  components,  respectively.  Images  can  also  be  decomposed  into 
wavelet  components  using  quadrature  mirror  filtering  (QMF),  resulting  in  a 
series  of  averaging  components  and  another  series  of  detailing  components 
[7,19J.  QMF  is  a  special  case  of  subband  coding,  where  filtered  components 
represent  toe  lower  and  upper  frequency  halves  of  the  original  signal  bandwidth. 
If  the  analyzing  filter  coefficients  are  symmetric,  then  the  synthesizing 
compont  n  s  are  mirrored  with  respect  to  the  half-band  value,  thus  the  term 
quadrature  mirror.  A  variety  of  applications  have  emerged  from  the  remarkable 
QMF  reconstruction  capabilities  [13,16,18]. 

Vision  pathways  (MP  and  PP)  and  QMF  filter  banks  both  therefore* 
break  up  the  input  image  signal  into  high  and  low  frequency  components. 
Recent  spxtral  analysis  of  Gaussian-derived  vision  models  indicate  a  remarkable 
retention  of  information  in  spite  of  the  nonorthogonal  nature  of  vision  filters 
[2,3]. 

Exam  of  Waveform  Translation  and  Dilation  in  Vision  Systems 

As  dcsuibed  previously,  a  wavelet  basis  is  generated  by  recursively  applying  a 
scaling  .unx''tion  to  a  mother  wavelet  [4,7].  The  essential  characteristics  of  the 
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wavelet  scaling  function  are  the  translation  and  dilation  parameters  of  (5)  and 
(6).  Structural  and  functional  subunits  of  vision  systems  also  tend  to  replicate  a 
mother  function  with  scaling  and  translational  variations.  These  variations  are 
functions  of  other  parameters  such  as  time,  gaze  direction,  and  eccentricity,  the 
distance  from  the  center  of  the  retina. 

Photonic  spreading  is  a  function  of  the  finite  aperture,  optical 
imperfections,  and  photonic  wavelength  [10].  A  Gaussian  mother  wavelet  of  a 
carefully  chosen  variance  can  be  chosen  to  model  the  point  spread  function  at  the 
retina  xnter  (zero  eccentricity)  at  a  selected  visible  frequency.  The  complete 
point  spread  function  can  be  modeled  as  a  two-dimensional  scaling  of  this  mother 
function  with  respect  to  both  eccentricity  and  photonic  frequency. 

The  photoreceptor  mosaic  includes  rods  and  three  cone  types,  called  L, 
M,  and  S  for  “long”,  “medium”,  and  “short”  visible  wavelength  peaks  in  their 
spectral  absorption  curves.  Sampling  in  the  center  of  the  retina  is  limited  to  L 
and  M  cells,  providing  a  significantly  higher  spatial  resolution  as  compared  to 
the  periphery  [10,11].  For  a  given  viewing  position,  a  mental  model  of  an 
environnent  is  built  from  a  set  of  high  resolution  samples  from  different 
directioiis.  These  translations  of  the  mosaic  function  are  the  result  of  eye 
movements  [5]. 

Due  to  scaling  with  frequency,  higher  frequency  wavelet  components 
typically  have  smaller  regions  of  support.  The  sampling  density  of  M  and  L  cone 
cells  is  greatly  reduced  as  a  function  of  increasing  eccentricity.  This  is  due  to 
increasing  cell  sizes  and  also  by  the  introduction  of  rod  cells  and  S  cells  in  the 
outer  regions.  A  cellular  receptive  field  function  representing  the  envelope  of 
available  photonic  energy  stimulating  either  M  or  L  cones  will  thus  include  an 
increasing  scale  factor  with  eccentricity. 

The  initial  electronic  processing  stages  of  early  vision  are  characterized 
by  DOG  filtering  operations  of  the  MP  and  PP.  The  spatial  extent  of  both  the  MP 
and  FP  receptive  fields  increase  significantly  with  eccentricity  [10].  A  DOG 
mother  function  can  be  scaled  with  respect  to  eccentricity  to  properly  model 
vision  pathways  at  different  parts  of  the  retina. 

Efficient  Use  of  Basis  Functions 

A  common  theme  among  space,  time,  and  color  domains  is  the  minimal 
use  of  h  sis  functions.  There  are  essentially  four  chromatic  detector  types,  three 
temporal  channels,  and  three  spatial  channels.  Therefore,  natural  neuronal 
systems  tend  to  use  efficient  combinations  of  only  a  few  filters  to  accomplish  a 
high  degree  of  acuity  in  each  domain.  QMF  signal  reconstruction  capability  is  a 
practical  demonstration  of  extracting  spectral  detail  from  only  two  filters.  The 
behavior  of  such  synthetic  applications  may  lead  to  a  deeper  understanding  of 
natural  phenomena. 
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A  large  proportion  of  today's  commercial  Optical  Character  Recognition  systems 
(OCR)  and  Handwriting  Recognition  Systems  (HWR)  use  neural  networks  at  the 
core  of  the  recognition  engine.  Comparisons  on  standard  databases  show  that  Neu¬ 
ral  Networks,  particularly  multi-layer  networks,  offer  a  good  combination  of  speed, 
generality,  simplicity,  and  flexibility.  They  are  also  particularly  well-suited  for  the 
large  input  dimension  required  for  shape  recognition  tasks  such  as  character  recog¬ 
nition. 

Neural  Networks  and  Machine  Learning  have  become  indispensable  ingredient  in 
the  design  of  OCR/HWR  systems.  Those  systems  are  generally  built  as  a  cascade  of 
independent  modules  including:  line  and  word  locators,  character  segmenters,  fea¬ 
ture  extractors,  character  recognizers,  and  language  models.  However,  in  most 
cases,  only  the  character  recognizer  is  trainable.  We  describe  a  new  learning  para¬ 
digm  called  Graph  Transformer  Networks  that  allows  all  the  modules  in  such  a  sys¬ 
tem  to  be  trained  simultaneously  so  as  to  maximize  a  global  performance  measure. 
Each  module,  called  a  Graph  Transformer,  takes  graphs  as  input  and  produces 
graphs  as  output.  The  arcs  on  the  graphs  carry  numerical  information  (scalars  or 
vectors)  such  as  images,  scores,  and  class  labels.  A  gradient-based  learning  proce¬ 
dure  can  be  used  to  train  the  parameters  of  the  modules  so  as  to  maximize  a  global 
objective  function. 

Graph  Transformer  modules  offer  much  increased  flexibility  over  traditional  gradi¬ 
ent-based  learning  systems  such  as  multilayer  neural  networks  that  communicate 
their  states  and  gradients  via  fixed-size  vectors.  A  complete  system  based  on  this 
concept  for  reading  handwritten  and  printed  bank  checks  is  described.  It  contains 
hundreds  of  thousands  of  trainable  parameters  and  combines  convolutional  neural 
network  character  recognizers  with  graph-based  stochastic  models  trained  coopera¬ 
tively  at  the  document  level.  It  is  deployed  commercially  and  reads  million  of  busi¬ 
ness  and  personal  checks  per  month  with  record  accuracy. 
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Abstract 

We  have  previously  introduced  the  Gamma  MLP  which  is  defined  as  an 
MLP  with  the  usual  synaptic  weights  replaced  by  gamma  filters  and  associated 
gain  terms  throughout  all  layers.  In  this  paper  we  apply  the  Gamma  MLP  to  a 
larger  scale  speech  phoneme  recognition  problem,  analyze  the  operation  of  the 
network,  and  investigate  why  the  Gamma  MLP  can  perform  better  than  alterna¬ 
tives.  The  Gamma  MLP  is  capable  of  employing  multiple  temporal  resolutions 
(the  temporal  resolution  is  defined  here,  as  per  de  Vries  and  Principe,  as  the 
number  of  parameters  of  freedom  (i.e.  the  number  of  tap  variables)  per  unit  of 
time  in  the  gamma  memory  -  this  is  equal  to  the  gamma  memory  n  parameter  as 
detailed  in  the  paper).  Multiple  temporal  resolutions  may  be  advantageous  for 
certain  problems,  e.g.  different  resolutions  may  be  optimal  for  extracting  dif¬ 
ferent  features  from  the  input  data.  For  the  problem  in  this  paper,  the  Gamma 
MLP  is  observed  to  use  a  large  range  of  temporal  resolutions.  In  comparison, 
TDNN  networks  typically  use  only  a  single  temporal  resolution.  Further  moti¬ 
vation  for  the  Gamma  MLP  is  related  to  the  “curse  of  dimensionality”  and  the 
ability  of  the  Gamma  MLP  to  trade  off  temporal  resolution  for  memory  depth, 
and  therefore  increase  memory  depth  without  increasing  the  dimensionality  of 
the  network.  The  HR  MLP  is  a  more  general  version  of  the  Gamma  MLP  - 
however  the  HR  MLP  performs  poorly  for  the  problem  in  this  paper.  Investi¬ 
gation  suggests  that  the  error  surface  of  the  Gamma  MLP  is  more  suitable  for 
gradient  descent  training  than  the  error  surface  of  the  HR  MLP. 
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1  Introduction 


Machine  learning  models  used  for  speech  recognition  are  required  to  account  for  a 
high  degree  of  variability  in  the  data  (e.g.  acoustic  variability,  within-speaker  vari¬ 
ability,  across-speaker  variability,  and  phonetic  variability).  For  phoneme  recogni¬ 
tion,  methods  of  addressing  these  variabilities  include  using  larger  datasets  and  us¬ 
ing  models  which  take  into  account  greater  context  of  the  acoustic  signal.  However, 
taking  into  account  greater  context  typically  leads  to  larger  models.  The  amount  of 
training  data  required  for  accurate  estimation  of  class  distributions  can  increase  sig¬ 
nificantly  when  the  input  dimensionality  increases  (cf.  the  “curse  of  dimensionality” 
[6])^ .  As  the  complexity  of  the  desired  target  function  for  a  given  problem  increases 
while  the  amount  of  data  remains  constant,  it  becomes  increasingly  problematic  to 
estimate  the  target  function  from  finite  data  due  to  the  ill-posed  nature  of  the  prob¬ 
lem  -  many  of  the  models  which  fit  the  training  data  closely  do  not  generalize  well 
to  unseen  data.  In  order  to  reduce  the  difficulty  with  trying  to  approximate  a  func¬ 
tion  which  is  too  complex  for  the  available  data,  we  often  consider  looking  for  a 
hierarchical  solution  where  initial  layers  extract  features  which  identify  higher  level 
attributes  of  the  data  which  enhance  generalization.  These  features  can  be  extracted 
manually,  or  automatically.  The  Gamma  MLP  considers  a  transformation  for  the 
inputs  to  each  node  and  aims  to  optimize  the  transformation  for  each  node  individ¬ 
ually  in  order  to  improve  performance.  The  process  can  be  thought  of  as  automatic 
feature  extraction  (if  the  optimal  transformations  were  known  beforehand  then  those 
transformations  could  be  used  to  extract  new  features  from  the  data). 


2  The  Gamma  Filter 


Infinite  Impulse  Response  (HR)  filters  have  a  significant  advantage  over  Finite  Im¬ 
pulse  Response  (FIR)  filters  in  signal  processing:  the  length  of  the  impulse  response 
is  uncoupled  from  the  number  of  filter  parameters.  The  length  of  the  impulse  re¬ 
sponse  is  related  to  the  memory  depth^  of  a  system,  and  hence  HR  filters  allow  a 
greater  memory  depth  than  FIR  filters  of  the  same  order.  However,  HR  filters  are 
not  widely  used  in  adaptive  signal  processing  [9].  This  may  be  attributed  to  the 
fact  that  a)  there  may  be  instability  during  training  and  b)  the  gradient  descent 
training  procedures  are  not  guaranteed  to  locate  the  global  optimum  in  the  possibly 
non-convex  error  surface  [11]. 


^Additionally,  increases  in  the  “complexity”  of  the  desired  target  function  may  make  gradient  descent 
optimization  more  difficult  -  training  algorithms  may  take  longer  to  converge  or  become  “stuck”  in  local 
minima  or  “plateaus”  which  are  increasingly  poor  compared  to  the  global  optimum. 

^  A  greater  memory  depth  implies  that  the  model  can  retain  past  information  for  a  longer  time. 


257 


The  use  of  gamma  filters  as  a  memory  structure  at  the  input  of  an  otherwise  stan¬ 
dard  MLP  network  was  proposed  by  de  Vries  and  Principe  [5].  The  gamma  filter,  a 
special  case  of  an  HR  filter,  is  designed  to  retain  the  uncoupling  of  memory  depth  to 
the  number  of  parameters  provided  by  HR  filters,  but  to  have  simple  stability  con¬ 
ditions.  The  output  of  a  neuron  in  a  multilayer  perceptron  is  computed  using^  = 

/  addition  of  short  term  memory  with  delays  was  consid¬ 
ered  by  de  Vries  and  Principe  [5]:  yi=  f  (e£o'  EjLo  “  i)) 

where  j  =  1, 2, . . . ,  JT.  The  depth  of  the  memory  is 

controlled  by  and  K  is  the  order  of  the  filter.  For  the  discrete  time  case,  de  Vries 
and  Principe  [5]  obtain  the  following  recurrence  relation: 

z-m-/  (1) 

where  x{t)  is  the  filter  input  and  Zj(t)  are  the  filter  outputs.  For  //  <  1  the  gamma 
filter  may  be  considered  as  a  low  pass  filter.  For  //  =  1,  the  memory  is  a  tapped 
delay  line  corresponding  to  the  memory  structure  in  an  FIR  MLP  (An  MLP  where 
the  weights  are  replaced  by  FIR  filters  and  optional  gain  terms  [2])  or  a  TDNN. 
For  /X  <  1  the  gamma  memory  structure  implements  a  tapped  dispersive  delay  line 
where  the  degree  of  dispersion  is  controlled  by 

de  Vries  and  Principe  [9]  define  the  temporal  resolution,  R,  of  a  gamma  memory 
structure  as  the  number  of  parameters  of  freedom  (i.e.  the  number  of  tap  variables) 
per  unit  of  time  in  the  filter  memory:  R  =  K f  D  —  where  D  is  the  memory 
depth  of  the  structure  (the  temporal  mean  value  of  the  impulse  response  of  the  last 
tap)  [10]:  D  =  K/^.  When  the  memory  depth  is  equal  to  the  order  of  the 

memory,  K.  The  memory  depth  increases  when  /x  <  1,  and  the  temporal  resolution 
decreases,  i.e.  the  gamma  memory  can  trade  resolution  for  memory  depth.  There¬ 
fore  the  gamma  memory  can  be  used  to  create  models  which  can  take  into  account 
greater  context  with  fewer  parameters  (without  resorting  to  the  use  of  a  single  low 
temporal  resolution)  in  comparison  to  TDNN  or  FIR  MLP  models. 


^where  y[  is  the  output  of  neuron  k  in  layer  I,  Ni  is  the  number  of  neurons  in  layer 
weight  connecting  neuron  k  in  layer  I  to  neuron  i  in  layer  /  —  1,  j/g  —  ^  (bias),  and  /  is  commonly  a 
sigmoid  function. 
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3  The  Gamma  MLP 


3.1  Motivation 

The  focused  gamma  network  which  uses  the  gamma  memory  as  a  preprocessing 
layer  for  a  standard  MLP  has  been  proposed  by  de  Vries  and  Principe  [5].  This  net¬ 
work  allows  for  the  use  of  only  one  temporal  resolution  per  input.  However,  it  may 
be  desirable  to  use  multiple  temporal  resolutions  (e.g.  different  resolutions  may  be 
optimal  for  extracting  different  features  or  for  classifying  different  phonemes).  The 
Gamma  MLP  is  similar  to  a  standard  MLP  except  every  synapse  contains  a  gamma 
memory  structure  and  a  gain  factor.  The  temporal  resolution  of  the  memory  in  each 
synapse  is  adjusted  separately.  Therefore,  in  contrast  with  the  focused  gamma  net¬ 
work,  the  Gamma  MLP  is  able  to  use  multiple  temporal  resolutions.  Additionally, 
the  Gamma  MLP  can  contain  gamma  memory  structures  in  all  layers  of  the  network. 

Other  motivation  for  the  Gamma  MLP  can  be  seen  with  comparison  to  TDNN,  FIR 
MLP  and  HR  MLP  (An  MLP  where  the  weights  are  replaced  by  HR  filters  and 
optional  gain  terms  [1])  models.  In  comparison  to  the  TDNN  and  FIR  MLP  models, 
the  Gamma  MLP  may  provide  improved  performance  because  it  allows  temporal 
resolution  to  be  traded  for  memory  depth,  i.e.  for  a  system  of  given  dimensionality, 
the  Gamma  MLP  can  employ  filters  with  a  greater  memory  depth.  Additionally,  in 
comparison  with  the  HR  MLP,  the  Gamma  MLP  may  be  significantly  easier  to  train, 
which  is  discussed  further  in  section  5. 


3.2  Definition 


Definition  1  A  Gamma  MLP  with  L  layers  excluding  the  input  layer  (0, 1, ...,  L),  gamma 
filters  of  order  K,  and  No,Ni, ...,  Nl  neurons  per  layer,  is  defined  as: 


yUt)  = 


xiit)  = 


Zkij{t)  ~ 


f  (ajfcCO) 

K 

i=0  j=0 

J  (1  -  fj'ki{^))zkij(t  -  1)  +  fJ‘ki{i)zki(j-i)(^  “  I  <  J  ^  ^ 

1  j  =  o 


(2) 

(3) 

(4) 


where  (t)  is  the  output  of  neuron  k  in  layer  I  at  time  t,  cii  —  synaptic  gain,  f(cif)  = 
tanh(a)  =  (e“/^  -  4-  fe  =  1, 2, ...,  Ni  (neuron  index).  /  =  0, 1, ...,  L 

(layer),  and  z[ij\i=o  =  =  0,Cfcij|t=o  =  1  (bias). 


□ 
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A  Gamma  MLP  is  defined  as  a  multilayer  perceptron  where  every  synapse  contains 
a  gamma  filter  and  a  gain  term  (introduced  in  [7]),  as  shown  in  the  definition  above. 
The  Gamma  MLP  is  therefore  a  special  case  of  the  HR  MLP  [1].  The  motivation  be¬ 
hind  the  inclusion  of  the  gain  term  is  discussed  in  section  5.  A  separate  /x  parameter 
is  used  for  each  filter.  Gradient  descent  update  equations  for  the  Gamma  MLP  are 
given  in  [7].  In  practice,  it  is  often  desirable  to  restrict  the  Gamma  MLP  structure 
by  using  Gamma  filter  only  in  the  first  layer  and/or  not  using  the  synaptic  gain  terms 
(4i)-  as  is  also  the  case  for  FIR  and  HR  MLP  networks. 


4  Phoneme  Recognition 

4.1  Task  Details 

Our  data  consists  of  the  “sa”  sentences  spoken  by  male  members  of  demographic 
region  3  in  the  TIMIT  database.  There  are  79  speakers.  The  problem  is  therefore 
speaker  independent  phoneme  prediction.  The  speakers  in  the  training  and  test  sets 
do  not  overlap. 

The  raw  speech  data  was  preprocessed  into  a  sequence  of  frames  using  PLP.  The 
analysis  window  (frame)  was  20  ms.  Each  succeeding  frame  overlapped  with  the 
preceding  frame  by  10  ms.  9  PLP  coefficients  plus  the  signal  power  were  extracted 
and  used  as  features  describing  each  frame  of  data.  The  difference  between  the 
current  and  previous  frames  was  added  to  the  input  vectors,  as  is  commonly  done 
[4].  Periods  of  silence  before  and  after  the  sentences  were  reduced  to  two  frames 
in  order  to  limit  any  skew  of  the  results  caused  by  a  disproportionate  percentage  of 
silence  frames. 

The  models  had  40  outputs  corresponding  to  the  40  phonemes'^.  The  FIR  and  gamma 
filter  orders  were  4  (5  taps),  and  the  TDNN  model  had  an  input  window  of  5  steps 
in  time.  The  training  set  contained  10,000  frames,  the  test  set  and  validation  sets 
contained  5,000  frames,  and  the  networks  had  40  hidden  nodes.  The  networks  were 
trained  for  200,000  updates.  We  used  standard  backpropagation  with  stochastic 
update.  The  tanh  activation  function  was  used.  A  “search  then  converge”  learning 
rate  schedule  was  used  with  an  initial  learning  rate  of  0. 1  for  the  /x  parameters  and 
0.2  for  all  other  parameters. 


'^The  TIMIT  allophones  were  converted  to  the  standard  40  phoneme  set  [8]. 
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4.2  Results 


Results  are  presented  for  frame  level  phoneme  recognition,  i.e.  for  each  frame  the 
recognizer  predicts  the  current  phoneme.  A  few  observations  regarding  the  expected 
results:  guessing  would  result  in  97.5%  error,  coarticulation^  makes  the  task  diffi¬ 
cult,  and  the  possibility  of  zero  error  would  not  be  expected  due  to  inevitable  diffi¬ 
culty  and  errors  in  the  phoneme  labelling. 

The  frequencies  of  the  forty  classes  varies  significantly,  and  it  was  found  that  all 
models  had  a  tendency  to  “ignore”  the  rarer  phonemes  [3]  due  to  biases  inherent 
in  the  neural  network  architecture  and  training  algorithm.  We  therefore  employed 
a  scaling  technique  whereby  weight  updates  are  scaled  on  a  class  by  class  basis. 
The  amount  of  scaling  is  varied  using  a  control  parameter,  Cg,  from  none  (c^  =  0)  to 
scaling  according  to  the  prior  probabilities  of  the  classes  (c^  =  1).  Yaeger  et  al.  have 
recently  introduced  a  very  similar  technique  which  they  call  “frequency  balancing” 
[12]. 

Reporting  results  in  terms  of  the  percentage  of  correct  classifications  can  be  mis¬ 
leading  when  the  frequency  of  the  individual  classes  varies  significantly  (e.g.  a 
relatively  low  error  rate  may  be  achieved  by  a  network  which  ignores  low  frequency 
classes).  For  this  reason,  results  are  reported  here  in  terms  of  the  MSSE  which  is 
defined  as:  MSSE  =  X)£i  (1  “  =  the  number  of  classes  and  Si 

=  the  sensitivity  of  class  i.  The  sensitivity  of  a  class  is  defined  as  the  proportion  of 
events  labelled  as  that  class  which  are  correctly  detected.  This  criterion  was  chosen 
because  each  class  is  given  equal  importance  and  the  square  causes  lower  individual 
sensitivities  to  be  penalized  more  (e.g.  for  a  two  class  problem,  class  sensitivities  of 
100%  and  0%  produce  a  higher  MSSE  than  sensitivities  of  50%  and  50%). 

Figure  1  shows  the  results  for  the  Gamma  MLP,  FIR  MLP,  and  TDNN  networks. 
The  degree  of  scaling,  Cs,  was  varied  from  0  to  1 .  Five  trials  were  performed  in  each 
case.  The  FIR  MLP  and  Gamma  MLP  networks  contained  filters  in  both  layers. 
The  Gamma  MLP  contained  synaptic  gains,  however  the  FIR  MLP  was  found  to 
perform  significantly  better  without  the  synaptic  gains  for  this  problem.  Scaling  with 
Cs  =  0.75  resulted  in  the  best  performance  for  each  of  the  networks  and,  therefore, 
scaling  with  Cg  =  0.75  was  used  for  the  later  results. 

Results  for  the  HR  MLP  are  not  shown  because  it  was  not  possible  to  obtain  signif¬ 
icant  convergence.  Theoretically,  the  HR  MLP  model  is  the  most  powerful  model 
used  here  (in  the  sense  that  it  can  represent  a  greater  variety  of  computational  struc¬ 
tures  than  the  other  networks  with  the  same  number  of  hidden  nodes).  In  particular, 
the  Gamma  MLP  is  a  special  case  of  the  HR  MLP.  Although  the  HR  MLP  is  prone 

^  Coarticulation  refers  to  changes  in  the  way  a  speech  segment  is  articulated  depending  on  previous 
(backward  coarticulation)  and  following  segments  (forward  coarticulation). 
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to  stability  problems,  the  stability  of  the  model  can  and  was  controlled  in  the  sim¬ 
ulations  performed  here  (by  reflecting  poles  that  move  outside  the  unit  circle  back 
inside).  The  most  obvious  hypothesis  for  the  difficulty  in  training  the  model  is  re¬ 
lated  to  the  error  surface  and  the  nature  of  gradient  descent.  It  is  expected  that  the 
error  surface  of  the  HR  MLP  presents  greater  difficulty  to  gradient  descent  optimiza¬ 
tion.  This  is  discussed  further  in  the  next  section. 


Degree  of  Prior  Scaling 

Figure  1.  Test  MSSE  results  as  the  degree  of  scaling  is  modified.  The  best  error 
corresponds  to  a  scaling  degree  of  0.75  for  each  network  type.  At  each  point,  box- 
whiskers  plots  are  shown  on  the  left  and  the  mean  plus  and  minus  one  standard 
deviation  is  shown  on  the  right.  Five  trials  were  performed  in  each  case. 


5  Discussion 


The  Gamma  MLP  may  perform  better  than  the  standard  TDNN  and  the  FIR  MLP 
for  speech  recognition  because  the  gamma  filtering  operation  allows  processing  the 
input  data  using  multiple  temporal  resolutions.  The  Gamma  MLP  can  therefore 
account  for  more  past  history  of  the  signal  for  a  system  of  a  given  order  (without 
resorting  to  the  use  of  a  single  low  temporal  resolution).  Figure  2  shows  the  distribu¬ 
tion  of  the  gamma  fj,  parameters  in  a  typical  trained  Gamma  MLP.  It  can  be  seen  that 
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a  range  of  fi  parameters,  and  therefore  a  range  of  temporal  resolutions,  is  employed 
by  the  network. 


Figure  2.  The  final  distribution  of  the  gamma  fi  parameters  for  a  sample  Gamma 
MLR 

The  Gamma  MLR  often  performs  better  when  using  the  synaptic  gain  terms.  This 
improvement  may  be  considered  non-intuitive  to  many  -  the  synaptic  gains  add 
degrees  of  freedom,  but  no  additional  representational  power.  Hgwever,  the  error 
surface  will  be  different  in  each  case,  and  results  indicate  that  the  surface  for  the 
synaptic  gains  case  can  often  be  more  amenable  to  gradient  descent. 

For  the  problem  considered  here,  the  Gamma  MLR  performs  significantly  better 
than  the  HR  MLR  although  the  Gamma  MLP  is  a  special  case  of  the  HR  MLR  It 
is  reasonable  to  believe  that  the  HR  MLR  could  perform  as  well  as,  or  possibly 
better  than,  the  Gamma  MLR,  but  in  practice  it  is  difficult  to  make  it  do  so  for 
the  problem  considered  here.  Figure  3  shows  sample  plots  of  the  error  surface  for 
Gamma  and  HR  MLP  networks.  In  order  to  reduce  computational  expense  and 
use  networks  with  fewer  parameters  to  aid  visualization,  a  simpler  task  has  been 
chosen.  The  task  is  Mackey-Glass  prediction  using  networks  that  contain  only 
five  hidden  nodes  (the  order  of  the  filters  was  4,  the  initial  learning  rate  was  0.1, 
the  training,  test,  and  validation  sets  contained  500  points,  and  100,000  stochastic 
updates  were  performed  in  each  case).  Even  with  such  small  networks,  the  error 
surface  has  many  dimensions  making  visualization  difficult.  Each  plot  in  the  figures 
is  with  respect  to  two  randomly  chosen  dimensions.  In  each  case,  the  center  of  the 
plot  corresponds  to  the  values  of  the  parameters  after  training  and  the  range  of  each 
parameter  on  the  plot  is  8.  The  NMSE  was  evaluated  at  225  points  equally  spaced 
in  a  grid.  For  the  HR  MLP,  a  greater  percentage  of  “flat  spots”  and  complex  surfaces 
can  be  observed.  On  average,  the  error  surface  for  the  HR  MLP  appears  to  be  less 
suitable  for  gradient  descent  optimization,  reinforcing  the  conclusion  that  the  poorer 
performance  of  the  HR  MLP  is  due  to  optimization  being  more  difficult.  Hence,  in 
using  the  Gamma  MLP  instead  of  the  HR  MLP,  we  are  trading  off  computational 
capacity  for  easier  training.  The  test  NMSE  results  for  20  simulations  each  using 
these  networks  show  that  the  best  performing  HR  MLP  was  only  slightly  worse  than 
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the  best  performing  Gamma  MLR  However,  the  Gamma  MLR  was  significantly 
better  on  average  (NMSE  of  0.0341  versus  0.185  for  the  HR  MLR). 


Figure  3.  Error  surface  plots  for  a  sample  Gamma  MLR  (left  two  columns)  and  a 
sample  HR  MLR  (right  two  columns).  Each  plot  is  with  respect  to  two  randomly 
chosen  dimensions.  In  each  case,  the  center  of  the  plot  corresponds  to  the  values 
of  the  parameters  after  training.  The  2;-axis  scale  varies  from  plot  to  plot  in  order 
to  show  the  qualitative  aspects  of  the  surface  (the  plots  only  cover  variation  in  two 
dimensions  and  are  only  plotted  around  one  point  in  weight  space,  therefore  quan¬ 
titative  conclusions  should  be  drawn  from  the  final  NMSE  results).  From  many  of 
these  plots  we  have  observed  that  there  is  a  greater  percentage  of  “flat  spots”  and 
complex  surfaces  for  the  HR  MLR. 


6  Conclusions 


We  have  applied  the  Gamma  MLR  to  a  speech  phoneme  recognition  problem,  an¬ 
alyzed  the  operation  Of  the  network,  and  investigated  why  the  Gamma  MLR  can 
perform  better  than  alternatives.  The  Gamma  MLR  is  capable  of  employing  mul¬ 
tiple  temporal  resolutions,  which  may  be  advantageous  for  certain  problems,  e.g. 
different  resolutions  may  be  optimal  for  extracting  different  features  from  the  input 
data.  For  the  problem  in  this  paper,  the  Gamma  MLR  is  observed  to  use  a  large  range 
of  temporal  resolutions.  In  comparison,  TDNN  networks  typically  use  only  a  single 
temporal  resolution.  The  Gamma  MLR  is  able  to  trade  off  temporal  resolution  for 
memory  depth,  and  therefore  increase  memory  depth  without  increasing  the  dimen¬ 
sionality  of  the  network  (or  using  a  single  low  temporal  resolution).  The  HR  MLR 
is  a  more  general  version  of  the  Gamma  MLR  -  however  the  HR  MLR  performed 
poorly  for  the  problem  in  this  paper.  Investigation  suggested  that  the  error  surface  of 
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the  Gamma  MLP  is  more  suitable  for  gradient  descent  training  than  the  error  surface 
of  the  HR  MLR 
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Abstract 

We  present  the  problem  of  designing  a  classifier  system 
based  on  hidden  Markov  models  (HMMs)  from  a  labeled 
training  set  with  the  objective  of  minimizing  the  rate  of  mis- 
classification.  The  traditional  design  approach  divides  the 
training  set  into  subsets  of  identically  labeled  training  vectors 
and  independently  designs  the  HMM  corresponding  to  each 
subset  of  the  training  data  using  a  maximum  likelihood  crite¬ 
rion.  However,  this  approach  does  not  achieve  the  minimum 
mis-classification  objective.  To  design  the  globally  optimal 
recognizer,  all  the  HMMs  must  be  jointly  optimized  to  mini¬ 
mize  the  number  of  mis-classified  training  patterns.  This  is  a 
difficult  design  problem  which  we  attack  using  the  technique 
of  deterministic  annealing  (DA).  In  the  DA  approach,  we  in¬ 
troduce  randomness  in  the  classification  rule  and  minimize  the 
expected  mis-classification  rate  of  the  random  classifier  while 
controlling  the  level  of  randomness  in  its  decision  via  a  con¬ 
straint  on  the  Shannon  entropy.  The  effective  cost  function 
is  smooth  and  converges  to  the  mis-classification  cost  at  the 
limit  of  zero  entropy  (non-random  classification  rule).  The 
DA  approach  can  be  implemented  via  an  efficient  forward- 
backward  algorithm  for  recomputing  the  model  parameters, 
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This  algorithm  significantly  outperforms  the  standard  maxi¬ 
mum  likelihood  algorithm  for  a  moderate  increase  in  design 
complexity. 


1  Introduction 

The  hidden  Markov  model  (HMM)  is  commonly  used  as  a  stochastic  model  for 
time  sequences.  HMMs  were  originally  applied  within  main-stream  statistics, 
but  the  discovery  of  their  applicability  to  modeling  speech  utterances  [3,  6] 
has  led  to  extensive  research  activity  in  HMMs  over  the  last  three  decades.  An 
overwhelming  number  of  conventional  speech  recognition  systems  are  based 
on  the  use  of  the  HMM  to  model  various  speech  utterances  within  the  context 
of  traditional  discriminant-based  pattern  classification. 

In  this  paper,  we  address  the  problem  of  recognition  of  time  sequences 
modeled  by  HMMs.  It  is  formally  defined  as  the  design  of  a  recognizer  based 
on  a  labeled  training  set  (i.e.,  supervised  learning).  This  problem  has  been 
extensively  treated  in  the  speech  recognition  literature.  The  most  commonly 
used  approach  is  to  divide  the  training  set  into  subsets  of  identically  labeled 
training  vectors  and  independently  design  HMMs  for  each  subset  of  training 
data  via  maximum  likelihood  estimation  of  model  parameters.  After  design, 
the  system  is  used  for  recognizing  new  sequences  through  competition  between 
the  designed  HMMs.  The  input  sequence  is  declared  to  belong  to  the  winner 
(the  most  likely  model). 

The  starting  point  of  our  work  is  the  realization  that  the  above  recog¬ 
nition  problem  is  fundamentally  a  pattern  classification  problem.  Further, 
the  quality  of  the  recognizer  is  most  appropriately  measured  by  its  rate  of 
classification  error.  This  leads  to  two  major  observations:  First,  the  glob¬ 
ally  optimal  recognizer  must  be  designed  through  joint  optimization  of  all 
models.  It  is  important  to  emphasize  that  the  ultimate  objective  is  not  to 
model  the  sequences  belonging  to  each  class  as  accurately  cis  possible,  but 
rather,  to  distinguish  between  the  classes  while  making  as  few  errors  as  possi¬ 
ble.  As  classification  is  performed  by  competition  between  models,  it  is  clear 
that  we  must  optimize  all  the  model  parameters  simultaneously  to  minimize 
classification  errors. 

This  also  connects  to  the  second  observation,  namely,  that  maximum  like¬ 
lihood  is  a  mismatched  cost  for  optimizing  the  classifier.  The  direct  measure 
of  success  is  simply  the  empirical  rate  of  correct  classification.  It  should  be 
noted  in  passing  that  the  Bayesian  classifier  which  is  optimal  in  the  sense 
of  minimum  classification  error,  is  a  close  relative  of  the  maximum  likeli¬ 
hood  approach  above.  However  its  success  depends  on  the  availability  of  the 
precise  probability  distributions,  including  the  assumption  that  the  model 
structure  is  in  complete  agreement  with  the  source.  If  one  has  only  access 
to  a  recisonably  short  training  set,  the  performance  of  maximum  likelihood 
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may  differ  significantly  from  that  of  minimum  classification  error,  as  will  be 
demonstrated  in  this  work.  We  note  that  the  shortcomings  of  the  maximum 
likelihood  method  have  been  previously  recognized  (e.g.  [1,  2,  4,  7])  and  joint 
optimization  approaches  have  been  suggested. 

There  are  several  important  difficulties  in  approaching  the  design  problem 
directly,  that  is,  by  joint  optimization  of  all  model  parameters  so  as  to  min¬ 
imize  the  rate  of  classification  error.  One  difficulty  is  that  unlike  maximum 
likelihood,  this  cost  function  is  piecewise  constant  and  all  gradients  with  re¬ 
spect  to  parameters  vanish  almost  everywhere  (an  infinitesimal  change  in  pa¬ 
rameter  values  will  not  change  the  classification  of  any  sequence  in  the  training 
set).  Thus,  one  cannot  simply  use  a  gradient  based  optimization  method.  An 
important  approach  to  address  this  problem  appeared  in  [7]  where  the  cost 
surface  was  smoothed  to  allow  the  application  of  gradient  methods  (A  few 
weeks  ago,  a  paper  appeared  [5],  where  this  method  was  extended  to  HMM 
classification.).  Another  important  difficulty  is  that  even  if  the  cost  surface 
is  smoothed,  the  optimization  process  tends  to  suffer  from  numerous  shallow 
local  minima  that  riddle  this  complex  cost  surface.  Finally,  one  must  keep 
in  mind  the  difficulties  associated  with  the  computational  complexity  of  such 
joint  optimization. 

The  main  contribution  of  this  paper  is  a  novel  method  for  designing  HMM- 
based  recognizers.  The  new  method  is  based  on  the  deterministic  annealing 
approach  to  clustering  [14,  13]  and  in  particular  to  its  recent  extension  to 
classification  [8].  By  introducing  randomness  that  is  controlled  by  impos¬ 
ing  the  level  of  Shannon  entropy,  we  obtain  an  effective  cost  function  that 
is  smooth  and  converges  to  the  original  classification  error  cost  at  the  limit 
of  zero  entropy.  Further,  this  process  is  analogous  to  physical  annealing  and 
hence  has  the  capability  to  avoid  many  shallow  minima  that  trap  standard 
local  optimization  methods.  It  is  also  important  to  note  that  unlike  the 
stochastic  procedure  of  simulated  annealing,  the  process  here  is  determin¬ 
istic  and  all  randomization  is  taken  into  account  by  taking  the  expectation 
of  the  various  quantities.  Another  important  result  is  the  development  of 
a  forward-backward  algorithm  (similar  to  Baum- Welch  re-optimization)  for 
recomputing  the  parameters  of  all  models  in  our  joint  optimization  frame¬ 
work.  (Note  that  here  we  do  not  use  maximum  likelihood  as  our  ultimate 
objective).  This  algorithm  is  instrumental  in  keeping  the  computational  com¬ 
plexity  manageable.  The  approach  is  shown  to  substantially  outperform  the 
standard  maximum  likelihood  method  at  the  cost  of  moderate  increase  in 
design  complexity  with  respect  to  separate  design  of  HMM  per  class. 


2  The  HMM  classifier  and  its  design 

We  address  the  supervised  learning  problem  of  designing  a  recognition  system 
from  a  labeled  training  set,  T  =  {(yi,  Ci),  (y2,  C2),  ..(yiVi  Cjv)}.  Each  irain- 
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ing  pattern,  y^,  is  a  vector  of  k  observations,  y*  =  (yi(l), yi(2),  •  ’  •yi(/*)). 
Further,  each  observation,  yi{t),  is  a  discrete  quantity,  i.e.  yi(t)  £  A  = 
{1, 2,  •  •  *,  A"}.  Despite  this  restriction  to  the  case  of  discrete  observations,  we 
note  that  the  design  methods  can  be  easily  be  extended  to  handle  continuous 
valued  observations  also.  The  training  pattern,  yi,  belongs  to  class,  c*,  which 
may  be  one  of  M  classes,  i.e.  Ci  £  C  =  {1, 2,  *  •  -M}. 

The  HMM  recognition  system  consists  of  a  set  of  hidden  Markov  models, 
{Hj,  j  =  1, 2,  •  •  • ,  M},  one  per  class  index.  The  model,  Hj  has  Sj  states  and 
is  fully  specified  by  the  parameter  set  Aj  =  (Aj ,  Bj ,  Ilj),  where  following  the 
usual  convention,  Aj  is  the  {Sj  x  Sj)  state  transition  probability  matrix,  Bj 
is  the  {Sj  X  K)  emission  probability  matrix  and  Hj  is  the  (length  Sj)  initial 
state  probability  vector. 

The  classifier  works  as  follows  :  Given  a  training  pattern,  y*,  for  each 
HMM,  Hj ,  and  for  each  sequence  (length  4)  of  states,  s  =  (s(l),  s(2),  *  •  • ,  s{li)) 
in  the  trellis  of  Hj,  we  determine  the  log  likelihood,  /(yi,s,  Hj),  that  the  ob¬ 
servation  y^  is  generated  via  the  state  sequence,  s.  Hence, 

h-i  h 

=  lognj(s(l))  +  ^logAj(s(«),s{<  +  1))  +  ^logSj(s(<),yi(<)). 

t-1  t=l 

Here,  Aj{m,n)  is  the  (m,n)  element  of  the  matrix,  Aj.  Similarly,  Bj{m,k) 
is  the  (m,  k)  element  of  matrix,  Bj,  and  nj(m)  is  the  mth  component  of  the 
vector,  Hj. 

Next,  we  maximize  the  log  likelihood  over  all  state  sequences  in  the  trellis 
of  HMM,  Hj,  and  determine 

dj{yi)=  max  l(yi,s,Hj).  (2) 

Here,  Si{Hj)  is  the  set  of  all  state  sequences  of  length  I  in  the  trellis  of 
HMM,  Hj.  The  quantity,  dj{yi)  thus  represents  the  log  likelihood  of  the 
state  sequence  in  model  Hj,  that  most  likely  generated  y*.  Interpreting  dj{-) 
as  the  discriminant  for  class  j,  we  adopt  the  traditional  discriminant-based 
classification  approach  to  define  the  classifier  operation  as  : 

C{yi)  =  ar  g  max  dj{yi).  (3) 

; 

We  refer  to  this  definition  as  the  “best  path”  discriminant  ^ .  This  classifica¬ 
tion  system  can  be  viewed  as  a  competition  between  paths.  The  observation 
is  ultimately  labeled  by  the  class  index  of  the  HMM  to  which  the  winning 
path  belongs.  One  advantage  of  the  “best  path”  discriminant  classifier  is 
that  the  search  for  the  most  likely  path  (choosing  a  state  sequence,  s,  that 
maximizes  (2))  can  be  reduced  to  a  sequential  optimization  problem  that  can 
be  solved  via  an  efficient  dynamic  programming  algorithm  (Viterbi  search), 

^  Our  design  method  can  be  easily  modified  to  the  case  where  the  discriminant  is  obtained 
by  appropriate  averaging  of  the  likelihood  over  all  paths  in  the  class  model. 
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2,1  HMM  classifier  design 

The  problem  of  HMM  classifier  design  can  be  stated  as  the  joint  optimiza¬ 
tion  of  the  HMM  parameters,  {Aj},  to  minimize  the  empirical  probability  of 
misclassification  measured  over  the  training  set, 

1  ^ 

min  =  1  - 
{Ay}  iV 

where  6  is  the  error  indication  function:  ^(u,t;)  =  lifu  =  v  and  0  otherwise. 

The  most  important  difficulty  in  this  optimization  is  that  the  cost,  Pe, 
is  a  piecewise  constant  function  of  the  optimization  variables.  As  a  result, 
we  cannot  use  traditional  gradient  descent  based  optimization  methods  -  the 
gradients  are  zero  almost  everywhere.  One  approach  [7]  to  circumvent  this 
difficulty  is  to  replace  the  piecewise  cost  function  by  a  smooth  approxima¬ 
tion  to  it.  While  the  modified  cost  function  is  amenable  to  descent-based 
optimization,  in  practice,  there  are  numerous  shallow  local  minima  on  the 
complex  cost  surface  that  can  easily  trap  optimization  methods  based  on 
simple  descent.  In  the  next  section,  we  present  a  novel  approach  based  on 
deterministic  annealing  to  simultaneously  tackle  the  piecewise  nature  of  the 
cost  function  and  the  problem  of  shallow  local  minima  traps. 


3  Deterministic  Annealing  approach 

We  take  as  our  starting  point,  the  deterministic  annealing  approach  to  clus¬ 
tering,  vector  quantization  [14]  and  related  optimization  problems  [13]  and  its 
extension  to  structurally-constrained  clustering  problems  [8].  The  extended 
method  can  handle  problems  involving  structural  constraints  on  the  cluster¬ 
ing  rule  e.g.  tree  structured  vector  quantization,  pattern  classifiers  based 
on  parametric  discriminant  functions  etc.  We  have  recently  applied  the  ex¬ 
tended  DA  method  successfully  to  the  design  of  standard  pattern  classifiers 
[8],  regression  functions  [9,  12,  11]  and  source  coding  systems  [10].  The  work 
presented  in  this  paper  represents  an  important  extension  of  the  method  to 
handle  time  sequences  that  are  modeled  by  HMMs. 

We  cast  the  optimization  problem  within  a  probabilistic  framework  and 
maintain  that,  during  design,  it  is  useful  to  consider  a  randomized  HMM 
classifier  system.  In  the  randomized  classifier,  given  an  observation,  a  win¬ 
ning  state  sequence  is  randomly  chosen  from  among  all  state  sequences  in 
all  the  HMMs.  This  (random)  choice  of  the  winning  state  sequence  is  based 
on  a  probability  distribution  -  we  replace  the  best-path  discriminant  rule 
which  associates  a  pattern  to  a  unique  winning  state  sequence  by  a  ran¬ 
domized  best-path  discriminant  rule  that  associates  each  pattern,  y*,  to 
every  state  sequence,  s,  in  the  trellis  of  every  model,  Hj,  with  a  proba- 
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bility,  P{yi,s,Hj).  Naturally,  these  probabilities  are  normalized  such  that 

E  E  = 

j  seSi^(Hj) 

The  probabilities,  P{yi,  s,  Hj),  are  obtained  in  a  systematic  manner:  We 
first  note  that  the  non-random  best-path  discriminant  rule  may  be  expressed 
as  minimization  over  function, 

i 

(5) 

t 

After  randomness  is  introduced,  this  cost  function  is  replaced  by  the  expected 
cost, 

Y,  P{yi,s,Hj)l{yus,Hil  (6) 

*■  j  seSi^{Hj) 

which  is  minimized,  while  simultaneously  enforcing  a  level  of  randomness 
though  a  constraint  on  the  Shannon  entropy, 

i  j  SeSi^iHj) 


In  particular,  we  optimize  <  D  >  subject  to  H  -  H.  The  probability  dis¬ 
tribution  obtained  via  this  constrained  optimization  problem  is  the  Gibbs 


distribution, 


P{yi,s,Hj)  = 


(8) 


The  value  of  Shannon  entropy,  H,  corresponding  to  this  Gibbs  distribution 
is  determined  by  the  positive  scale  parameter,  j.  This  parameter  also  controls 
the  “randomness”  of  the  distribution.  For  7  =  0,  the  distribution  over  paths 
is  uniform.  For  finite,  positive  values  of  7,  the  Gibbs  distribution  indicates 
that  we  assign  higher  probabilities  of  winning  to  state  sequences  with  higher 
log  likelihoods.  In  the  limiting  case  of  7  — >■  00,  the  random  classification  rule 
reverts  to  the  non-random  “best  path”  classifier,  which  assigns  a  non-zero 
probability  of  winning  only  to  the  path  with  the  highest  log  likelihood  as  in 
(2). 

The  random  classifier’s  expected  rate  of  misclassification  (over  the  training 
set)  can  be  calculated  as 

<Pe>=i-^Y  E  (9) 

i=l  SG5,.(/fcJ 
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Next,  we  pose  the  problem  of  optimizing  this  random  HMM  classifier 
(choosing  {Aj}  and  7)  to  minimize  the  expected  mis-classihcation  probability 
of  (9).  However,  simply  minimizing  (9)  over  all  Gibbs  distributions  chooses 
one  that  is  non-random  (7  — >  00).  While  such  a  non-random,  best-path 
classifier  is  the  eventual  goal  of  this  design  method,  we  wish  to  enforce  the 
“non-randomness”  gradually  during  the  optimization,  to  avoid  shallow  local 
minima  traps. 

As  such,  we  follow  the  philosophy  underlying  the  deterministic  annealing 
approach  and  pose  the  problem  of  minimizing  <.  Pe  '>  while  maintaining  a 
level  of  randomness  in  the  classifier  through  a  constraint  on  the  entropy,  H  = 
H.  This  constrained  optimization  problem  is  equivalently,  the  minimization 
of  the  unconstrained  Lagrangian  cost  function, 

min  L  =<  Pe  >  —TH,  (10) 

{Aj},7 

where  T  is  the  Lagrange  parameter  that  we  refer  to  as  the  “temperature” 
because  of  an  interesting  analogy  in  statistical  physics. 

3.1  Analogy  to  statistical  physics 

The  Lagrangian  minimization  of  (10)  reminds  us  of  the  definition  of  ther¬ 
mal  equilibrium  in  statistical  physics.  The  quantity,  L,  is  analogous  to 
the  Helmholtz  free  energy  of  a  thermodynamic  system  with  average  energy 

<  Pe  >,  entropy  over  energy  states,  H  and  temperature,  T.  This  free  en¬ 
ergy  is  the  quantity  that  is  minimized  when  this  thermodynamic  system  is  at 
thermal  equilibrium  at  temperature,  T. 

From  the  optimization  viewpoint,  we  are  particularly  interested  in  thermal 
equilibrium  at  T  =  0  which  corresponds  to  direct  minimization  of  <  Pe  >,  our 
ultimate  objective.  The  analogy  to  physical  systems  suggests  that  to  minimize 

<  Pe  >,  it  is  useful  to  implement  an  annealing  process,  that  is,  gradually 
lower  the  temperature  while  maintaining  the  system  at  thermal  equilibrium. 
We  start  with  a  very  high  value  of  T,  where  the  sole  objective  is  entropy 
maximization,  which  is  achievable  by  the  uniform  distribution.  Reducing  T 
gradually  from  this  high  value,  we  repeat  the  process  of  minimizing  L  until 
T  =  0,  where  the  sole  objective  is  optimizing  {Aj}  and  7  to  minimize  Pg. 

After  this  annealing  process,  we  also  include  as  a  final  step,  a  “quench¬ 
ing”  mechanism  -  we  optimize'} Aj}  to  minimize  Pg,  while  increasing  7  from 
its  optimal  value  at  P  =  0,  in  gradual  steps,  to  a  very  high  value.  When 
7  is  sufficiently  high,  the  classifier  reduces  to  the  non-random  “best-path” 
classifier. 

The  annealing  process  yields  a  sequence  of  solutions  at  decreasing  levels 
of  entropy  and  Pg  leading  to  the  “best-path”  classifier  in  the  limit.  The  DA 
method  is  not  a  stochastic  method  like  simulated  annealing,  but  instead  based 
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on  the  optimization  of  the  deterministically  computed  expectation,  L,  at  each 
temperature.  This  minimization  is  achieved  by  a  series  of  gradient  descent 
steps  with  the  following  expressions  for  the  gradients  : 


dL 

dAj 


wT,  E  Hyi,s,  Hj)P{yi,s,  Hj){  < 

i  S€S,,(Wj) 


and 

2k- 

dj 

E  L{yi,s,Hj)P{yi,s,Hj){l{yi,s,Hj)-  <  l(yi,s,Hj)  >} 

i  j  SeSi.(Hj) 


Here,  L(yi,s,Hj)  =  Tjl{yi,s,Hj)  -  6{j,Ci).  The  operation,  <  /(•)  >j, 
represents  an  expectation  of  the  (state-sequence  dependent)  /(•)  function  over 
the  state  sequences  in  the  trellis  of  HMM,  Hj.  Hence, 


^  dl{yi,s,Hj)  ^ 
dAj 


(11) 


Similarly,  <  /(•)  >  represents  the  expectation  of  the  /(•)  function  over  all 
state  sequences  in  the  trellises  of  all  the  HMMs.  Hence, 

<  l(yi,s,Hj)  >-E  E  P{yi,s,Hj)l(yi,s,Hj).  (12) 

j  seSi.(Hj) 

An  important  aspect  of  the  proposed  method  is  the  discovery  of  an  efficient 
forward-backward  algorithm  to  determine  these  gradient  parameters.  Note 
that  the  summations  in  the  gradient  expressions  are  over  all  state  sequences 
in  the  trellis  of  HMMs.  The  number  of  paths  depends  exponentially  on  the 
number  of  states  in  the  HMM.  However,  these  summations  can  be  efficiently 
computed  via  a  forward-backward  algorithm  which  reduces  the  number  of 
computations  substantially  (proportional  to  square  of  the  number  of  states 
in  the  HMM)  thus  cutting  down  on  computational  complexity  and  memory 
requirements.  The  complexity  of  the  DA  method  scales  similarly  to  the  max¬ 
imum  likelihood  method  with  respect  to  the  number  of  states  and  training 
vectors. 


4  Experimental  Results 

We  have  performed  preliminary  simulations  to  determine  the  usefulness  of 
our  new  design  method.  We  experimented  on  designing  simple  (2,3  and  4 
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class)  classifier  systems  for  eight  different  data  sets  of  2000  vectors  each.  Al¬ 
lowing  three  to  six  states  in  each  Markov  model,  we  designed  HMM  classifier 
systems  using  the  maximum  likelihood  and  deterministic  annealing  methods. 
We  observe  that  the  proposed  DA  approach  improved  the  classification  per¬ 
formance  consistently  and  considerably.  Over  the  experiment’s  data  sets,  the 
rate  of  misclcissifi cation  was  reduced  by  factors  of  1.2  to  3.  Table  1  details 
the  results. 

We  are  currently  investigating  the  effectiveness  of  the  design  method  on 
real-world  speech  data  to  demonstrate  its  advantages  for  the  speech  recogni¬ 
tion  problem. 


Dataset 

1 

2 

3 

4 

No.  of  Classes 

2 

2 

2 

P,  (ML) 

17.4% 

31.6% 

mnanim 

Pe  (DA) 

6.5% 

21.7% 

18.7% 

Dataset 

5 

6 

7 

8 

No.  of  Classes 

3 

3 

3 

4 

Pe  (ML) 

27.0  % 

32.5% 

24.9  % 

42.3  % 

Pe  (DA) 

21.0% 

27.3% 

17.4% 

31.7% 

Table  1:  A  comparison  of  the  mis- classification  rates  obtained  for  HMM  clas¬ 
sifiers  designed  from  eight  classified  training  sets  of  2000  patterns  each.  Each 
set  consists  of  data  from  2,3  or  4  classes.  ML  represents  a  Max.  likelihood 
design  algorithm  and  DA  represents  the  deterministic  annealing  algorithm. 


5  Conclusion 

In  this  paper  we  propose  a  novel  training  method  for  HMM  classifier  systems 
that  jointly  optimizes  all  the  models  to  minimize  the  true  cost,  namely,  the 
rate  of  mis-classification.  At  the  cost  of  moderate  increase  in  complexity, 
considerable  improvements  in  recognition  rates  are  obtained. 
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Abstract 

We  investigate  the  problem  of  training  a  Support  Vector  Ma¬ 
chine  (SVM)  [i,  2,  7]  on  a  very  large  date  base  (e.g.  50,000 
data  points)  in  the  case  in  which  the  number  of  support  vec¬ 
tors  is  also  very  large  (e.g.  40,000).  Training  a  SVM  is  equiva¬ 
lent  to  solving  a  linearly  constrained  quadratic  programming 
(QP)  problem  in  a  number  of  variables  equal  to  the  num¬ 
ber  of  data  points.  This  optimization  problem  is  known  to 
be  challenging  when  the  number  of  data  points  exceeds  few 
thousands.  In  previous  work,  done  by  us  as  well  as  by  other  re¬ 
searchers,  the  strategy  used  to  solve  the  large  scale  QP  prob¬ 
lem  takes  advantage  of  the  fact  that  the  expected  number  of 
support  vectors  is  small  (<  3,000).  Therefore,  the  existing  al¬ 
gorithms  cannot  deal  with  more  than  a  few  thousand  support 
vectors.  In  this  paper  we  present  a  decomposition  algorithm 
that  is  guaranteed  to  solve  the  QP  problem  and  that  does  not 
make  assumptions  on  the  expected  number  of  support  vec¬ 
tors.  In  order  to  present  the  feasibility  of  our  approach  we 
consider  a  foreign  exchange  rate  time  series  data  base  with 
110,000  data  points  that  generates  100,000  support  vectors. 


1  Introduction 

In  this  paper  we  consider  the  problem  of  training  a  Support  Vector  Machine 
(SVM),  a  pattern  classification  algorithm  recently  developed  by  V.  Vapnik 
and  his  team  at  AT&T  Bell  Labs.  [1,  2,  7].  SVM  can  be  seen  as  a  new 
way  to  train  polynomial,  neural  network,  or  Radial  Basis  Functions  classi¬ 
fiers,  based  on  the  idea  of  structural  risk  minimization  rather  than  empirical 
risk  minimization  &  From  the  implementation  point  of  view,  training  a  SVM 

^The  neime  of  SVM  is  due  to  the  fact  that  one  of  the  outcomes  of  the  algorithm,  in 
addition  to  the  parameters  for  the  classifier,  is  a  set  of  data  points  (the  “support  vectors”) 
which  contain,  in  a  sense,  all  the  “relevant”  information  about  the  problem. 
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is  equivalent  to  solving  a  linearly  constrained  Quadratic  Programming  (QP) 
problem  in  a  number  of  variables  equal  to  the  number  of  data  points.  This 
problem  is  challenging  when  the  size  of  the  data  set  becomes  larger  than  a 
few  thousands,  which  is  often  the  case  in  practical  applications.  A  number 
of  techniques  for  SVM  training  have  been  proposed  [7,  4,  5,  6].  However, 
many  of  these  strategies  take  advantage  of  the  following  assumptions,  or  ex¬ 
pectations:  1)  The  number  of  support  vectors  is  small,  with  respect  to  the 
the  number  of  data  points;  2)  the  total  number  of  support  vectors  does  not 
exceed  a  few  thousands  (e.g.  <  3.000).  Since  the  ratio  between  the  num¬ 
ber  of  support  vectors  and  the  total  number  of  data  points  (averaged  over 
the  probability  distribution  of  the  input  variables)  is  an  upper  bound  on  the 
generalization  error,  the  previous  assumptions  are  violated  in  the  following 
cases:  1)  the  problem  is  “difficult”,  so  that  the  generalization  error  will  be 
large  and  therefore  the  proportion  of  support  vectors  is  high,  or  2)  the  data 
set  is  so  large  (say  300,000)  that  even  if  the  problem  can  have  small  gener¬ 
alization  error  (say  1%)  the  number  of  support  vectors  will  be  large  (in  this 
case  around  3,000). 

The  algorithm  that  we  present  in  this  paper  does  not  make  the  above  men¬ 
tioned  assumptions.  It  should  be  noticed,  however,  that  in  the  case  in  which 
the  assumptions  above  are  satisfied  the  algorithm  does  take  advantage  of 
them.  The  algorithm  is  similar  in  spirit  to  the  algorithm  that  we  proposed  in 
[4]  (that  was  limited  to  deal  with  few  thousands  support  vectors):  it  is  a  de¬ 
composition  algorithm,  in  which  the  original  QP  problem  is  replaced  by  a  se¬ 
quence  of  smaller  problems  that  is  proved  to  converge  to  the  global  optimum. 
Although  the  experiments  we  report  in  this  paper  concern  a  classification 
problem,  the  current  algorithm  can  also  be  used,  with  minimal  modifications, 
to  train  the  new  version  of  the  SVM,  that  can  deal  with  regression  as  well  as 
classification. 

The  plan  of  the  paper  is  as  follows:  in  the  next  section  we  briefly  sketch  the 
ideas  underlying  SVM.  Then  in  section  3  we  present  our  new  algorithm,  in 
section  4  we  show  some  results  of  our  implementation  on  a  financial  data 
set  with  110,000  data-points  with  as  many  as  100,000  support  vectors  and  in 
section  5  we  summarize  the  paper. 


2  Support  Vector  Machines 

In  this  section  we  briefly  sketch  the  SVM  algorithm  and  its  motivation.  A 
more  detailed  description  of  SVM  can  be  found  in  [7]  (chapter  5)  and  [2]. 
We  start  from  the  simple  case  of  two  linearly  separable  classes.  We  assume 
that  we  have  a  data  set  D  =  {(xi,yi)}f^i  of  labeled  examples,  where  y*  E 
{—1,1},  and  we  wish  to  select,  among  the  infinite  number  of  linear  classifiers 
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that  separate  the  data,  one  that  minimizes  the  generalization  error,  or  at 
lecist  an  upper  bound  on  it  (this  is  the  idea  underlying  the  structural  risk 
minimization  principle  [7]).  V.  Vapnik  showed  [7]  that  the  hyperplane  with 
this  property  is  the  one  that  leaves  the  maximum  margin  between  the  two 
classes  [1],  where  the  margin  is  defined  as  the  sum  of  the  distances  of  the 
hyperplane  from  the  closest  point  of  the  two  classes. 

If  the  two  classes  are  non-separable  the  SVM  looks  for  the  hyperplane  that 
maximizes  the  margin  and  that,  at  the  same  time,  minimizes  a  quantity 
proportional  to  the  number  of  misclassification  errors.  The  trade  off  between 
margin  and  misclcissification  error  is  controlled  by  a  positive  constant  C  that 
has  to  be  chosen  beforehand.  In  this  case  it  can  be  shown  that  the  solution 
to  this  problem  is  a  linear  classifier  /(x)  =  sign(5^^_i  Xiyix'^Xi  +  b)  whose 
coefficients  Aj  are  the  solution  of  a  QP  problem,  defined  over  the  hypercube 
[0,  C]^,  whose  precise  statement  will  be  given  in  section  3  (see  eq.  1).  Since  the 
quadratic  form  is  minimized  in  the  hypercube  [0,  CY,  the  solution  will  have  a 
number  of  coefficients  A*  exactly  equal  to  zero.  Since  there  is  a  coefficient  Aj 
associated  to  each  data  point,  only  the  data  points  corresponding  to  non-zero 
A*  (the  “support  vectors”)  will  influence  the  solution.  Intuitively,  the  support 
vectors  are  the  data  points  that  lie  at  the  border  between  the  two  classes.  It 
is  then  clear  that  a  small  number  of  support  vectors  indicates  that  the  two 
classes  can  be  well  separated. 

This  technique  can  be  extended  to  allow  for  non-linear  decision  surfaces.  This 
is  done  by  projecting  the  original  set  of  variables  x  in  a  higher  dimensional 
feature  space:  x  E  =>  z(x)  =  {^i{x),...,  (I)n{^))  6  J?”  (where  n  is  possibly 
infinite)  and  by  formulating  the  linear  classification  problem  in  the  feature 
space.  Vapnik  proves  that  there  are  certain  choices  of  features  for  which 
the  solution  has  the  following  form: 

/(x)  =  sign  I  XiyiK{x,  x*)  -|-  b 

\i=i 

where  K{x,y)  is  a  symmetric  positive  definite  kernel  function  that  depends 
on  the  choice  of  the  features  and  represent  the  scalar  product  in  the  feature 
space.  In  table  (1)  we  list  some  choices  of  the  kernel  function  proposed  by 
Vapnik:  notice  how  they  lead  to  well  known  classifiers,  whose  decision  surfaces 
are  known  to  have  good  approximation  properties. 


3  Training  a  Support  Vector  Machine 

In  this  section  we  present  a  decomposition  algorithm  that,  without  making 
assumptions  on  the  expected  number  of  support  vectors,  allows  us  to  train  a 
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Kernel  Function 

Type  of  Classifier 

i< 

1 

T 

II 

Gaussian  RBF 

K{x,Ki)  =  (x'^  Xi  +  1)^ 
K{x,Xi)  =  tanh(x^  Xj  -  0) 

Polynomial  of  degree  d 
Multi  Layer  Perceptron 

Table  1:  Some  possible  kernel  functions  and  the  type  of  decision  surface  they 
define. 

SVM  on  a  large  data  set  by  solving  a  sequence  of  smaller  QP  problems.  The 
two  key  issues  to  be  considered  are; 

1.  Optimality  Conditions:  These  conditions  allow  us  to  decide  com¬ 
putationally  whether  the  problem  has  been  solved  optimally  at  a  par¬ 
ticular  iteration  of  the  original  problem.  Section  3.1  states  and  proves 
optimality  conditions  for  the  QP  given  by  (1). 

2.  Strategy  for  Improvement:  If  a  particular  solution  is  not  optimal, 
this  strategy  defines  a  way  to  improve  the  cost  function  and  is  frequently 
associated  with  variables  that  violate  optimality  conditions.  This  strat¬ 
egy  will  be  stated  in  section  3.2. 

Using  the  results  of  sections  3.1  and  3.2  we  will  then  formulate  our  decompo¬ 
sition  algorithm  in  section  3.3. 


3.1  Optimality  Conditions 

The  QP  problem  that  we  have  to  solve  in  order  to  train  a  SVM  is  the  following 
[1,  2,  7]: 


Minimize 

W{A) 

=  -A^l  +  |A^DA 

A 

subject  to 

A^y 

=  0 

(.) 

A -Cl 

<  0 

-A 

<  0 

(H) 

where  (1)*  =  1,  Ay  = 

yiVjKixi, 

Xj),  //,  =  {vu. 

.  .,Vi)  and  = 

(tti  , . . . ,  TTi)  are  the  associated  Kuhn-Tucker  multipliers.  The  choice  of  the 
kernel  K  is  left  to  the  user,  and  it  depends  on  the  decision  surfaces  one  ex¬ 
pects  to  work  best.  Since  D  is  a  positive  semi-definite  matrix  (  the  kernel 
function  K  is  positive  definite  ),  and  the  constraints  in'(l)  are  linear,  the 
Kuhn-Tucker,  (KT)  conditions  are  necessary  and  sufficient  for  optimality. 
The  KT  conditions  are  as  follows: 
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VV7(A)  + Y-n  +  z^y  =0 


Y^(A  -  Cl)  =0 

n^A  /  =  0 

Y  >  0 

n  >  0 

A^y  =  0 

A  -  Cl  <  0 

-A  <0 


(2) 


In  order  to  derive  further  algebraic  expressions  from  the  optimality  conditions 
(2),  we  assume  the  existence  of  some  Xi  such  that  0  <  <  C,  and  consider 

the  three  possible  values  that  each  component  of  A  can  have: 


1.  Case:  0  <  A*  <  C 

From  the  first  three  equations  of  the  KT  conditions  we  have: 


{DA)i-l^fiyi=0  (3) 

Using  the  results  in  [2]  and  [7]  one  can  show  that  this  implies  that  fi  =  b. 
2.  Case:  Aj  =  C 

From  the  first  three  equations  of  the  KT  conditions  we  have: 

{D\)i  -  1  + 1),-  +  m  =  0  (4) 

It  is  useful  to  define  the  following  quantity: 


£ 

giici)  =  ^  Aj2/j/<(xi,xj)  +  6  (5) 

i=i 

Using  the  fact  that  /z  =  6  and  requiring  the  KT  multiplier  Vi  to  be 
positive  one  can  show  that  the  following  conditions  should  hold: 

Vigi^i)  <  1  (^) 

3.  Case:  A*  =  0 

From  the  first  three  equations  of  the  KT  conditions  we  have: 

{DA)i  -  1  -  TTi  +  =  0  (7) 
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By  applying  a  similar  algebraic  manipulation  as  the  one  described  for 
case  2,  we  obtain 


ViOM  >  1  (8) 

3.2  Strategy  for  Improvement 

The  optimality  conditions  derived  in  the  previous  section  are  essential  in  order 
to  devise  a  decomposition  strategy  that  guarantees  that  at  every  iteration  the 
objective  function  is  improved.  In  order  to  accomplish  this  goal  we  partition 
the  index  set  in  two  sets  B  and  N,  where  the  set  B  is  called  the  working  set 
Then  we  decompose  A  in  two  vectors  and  A^r,  keeping  fixed  Ajv  and 
allowing  changes  only  in  Ab,  thus  defining  the  following  subproblem: 

Minimize  Vt^(AB)  —  —  \  [A^DbbAb  +  A^Dba^  Ajv+ 

H-A^DjvsAb  +  AjJ-DivivA^-]  —  A^l 
Ab 

subject  to 

AJyb  +  A^yiv  =  0 

Ab-CI  <0 

-Ab  <  0 

(9) 

where  (1)^  =  1,  is  such  that  Aj  =  A:(xi, xy),  with  i  G  G  /?,  and 
C  is  a  positive  constant.  Using  this  decomposition  we  notice  that: 

•  The  terms  +  ^AJ^D^nAn  are  constant  within  the  defined  sub¬ 

problem. 

•  Since  K{x,  y)  is  a  symmetric  kernel,  the  computation  of  A^DbnAn  + 
AJjDnbAb  can  be  replaced  by  2A^qBN}  where: 

j€N 

This  is  a  very  important  simplification,  since  it  allows  us  to  keep  the 
size  of  the  subproblem  independent  of  the  number  of  fixed  variables 
An  ,  which  translates  into  keeping  it  also  independent  of  the  number  of 
support  vectors. 

•  We  can  replace  any  Aj,  i  G  B,  with  any  Aj,  j  £  N  (i.e.  there  is  no 
restriction  on  their  value),  without  changing  the  cost  function  or  the 
feasibility  of  both  the  subproblem  and  the  original  problem. 
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•  If  the  subproblem  is  optimal  before  such  a  replacement,  the  new  sub¬ 
problem  is  optimal  if  and  only  if  Xj  satisfies  the  Optimality  Conditions 
for  the  appropriate  case  (3  cases  described  above). 

The  previous  statements  lead  to  the  following  more  formal  propositions: 

Proposition  3.1  (^^Build  down '9;  moving  a  variable  from  B  to  N  leaves 
the  cost  function  unchanged,  and  the  solution  is  feasible  in  the  subproblem. 

Proof:  Let  B'  =  B\  {jfc}  and  N'  =  N  U  {A:}.  Then: 

W (Ab  ,  An)  =  -  Xi  -  Xi  -j-  -  Xi  Xj  Dij  +  2  Xi  Xj  Aj  4- 

ieB  i£N  L»,j€n  «€B  jeN 

-f-  ^  ^  Xi  Xj  Dij 

i,jeN 

=  +  |  ^  AiAj£>.j  +  2Afc^AiX)ifc+ 

ieB'  ieN  UJ€B'  ieB' 

+2Afc  ^  ^  XjDjk  +  2  ^  ^  Xi  ^  ^  XjDij  +  XffDkk  +  ^  ^  XiXjDij 
j£N  i€B‘  jGN  ijEN 

=  —  At  ~  ^  ^  ^  ^  XiXjDij  -1-  2  Xi  ^  ^  XjDij-\- 

ieB>  i£N'  L,i€B'  i&B'  j&N' 

"H  ^  ^  XiXjDij 
i,jeN< 

=  W{Ab',An>) 

The  solution  (Ab'.^N')  is  feasible  in  the  subproblem  since: 

0  =  A%yB  +  A^yjv 

=  +  Ajfc7/fc  +  A^yjv 

=  A%,yB‘  + 

and  the  bound  constraints  are  always  unaffected. 

Proposition  3.2  (^^©uild  up”^;  moving  a  variable  that  violates  the  optimal¬ 
ity  conditions  from  N  to  B  gives  a  strict  improvement  in  the  cost  function 
when  the  subproblem  is  re-optimized. 
prop:propdown 


282 


Proof:  This  is  a  direct  consequence  of  Proposition  3.1  and  the  fact  that 
Kuhn-Tucker  conditions  are  necessary  and  sufficient  for  optimality. 


3.3  The  Decomposition  Algorithm 

Using  the  results  of  the  previous  sections  we  are  now  ready  to  formulate  our 
decomposition  algorithm: 


1.  Arbitrarily  choose  \B\  points  from  the  data  set. 

2.  Solve  the  subproblem  defined  by  the  variables  in  B. 

3.  While  there  exists  some  j  E  N,  such  that: 

•  Xj  =0  and  g(xj)yj  <  1 

•  Xj  =C  and  g{xj)yj  >  1 

•  0  <  Aj  <  C  and  g{xj)yj  ^  1, 

replace  any  Xi,  i  E  5,  with  Xj  and  solve  the  new  subproblem  given  by: 


Minimize  ^(A^) 

As 

subject  to 

^bYb  +  ^nYn 
Ab  —  Cl 
—Ab 


where: 


=  -A^l  +  \A^DbbAb  +  A^qsjv 


=  0 
<0 
<  0 


(11) 


(<lBN)i  =  ieB  (12) 

i^N 

Notice  that  we  have  omitted  the  constant  term  — A^l  +  -AJ^DnnAn  in 
the  cost  function,  and  that  according  to  Proposition  3.2,  this  algorithm  will 
strictly  improve  the  objective  function  at  each  iteration  and  therefore  will  not 
cycle.  Since  the  objective  function  is  bounded  (W(A)  is  convex  quadratic  and 
the  feasible  region  is  bounded),  the  algorithm  must  converge  to  the  global 
optimal  solution  in  a  finite  number  of  iterations. 
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4  Implementation  and  Results 

We  have  implemented  the  decomposition  algorithm  using  MINOS  5.4  [3]  as 
the  solver  of  the  sub-problems.  We  tested  our  technique  on  a  problem  known 
for  being  “difficult” :  a  foreign  exchange  rate  time  series  that  was  used  in  the 
1992  Santa  Fe  Institute  Time  Series  Competition,  in  which  we  looked  at  the 
sign  of  the  change  of  the  time  series,  rather  than  its  value.  We  considered 
data  sets  of  increasing  sizes,  up  to  110,000  points,  obtaining  up  to  100,000 
support  vectors.  Figure  4  shows  the  relationship  between  training  times, 
number  of  data  points  and  number  of  support  vectors  in  our  experiments. 
The  training  time  on  a  SUN  Sparc  20  with  128  Mb  of  RAM  ranged  from 
3  hours  for  10,000  support  vectors  to  48  hours  for  40,000  support  vectors. 
The  results  that  we  obtain  are  comparable  to  the  results  reported  in  [8]  using 
a  Neural  Networks  approach,  where  generalization  errors  around  53%  were 
reported.  The  purpose  of  this  experiment  was  not  to  benchmark  SVM’s  on 
this  specific  problem,  but  to  show  that  its  use  in  a  problem  with  as  many  as 
100,000  support  vectors  is  computationally  tractable. 


Figure  1:  (a)Number  of  support  vectors  Vs.  number  of  data  points,  (b) 
Training  time  Vs.  number  of  data  points. 


5  Summary  and  Conclusions 

In  this  paper  we  have  presented  a  novel  decomposition  algorithm  that  can 
be  used  to  train  Support  Vector  Machines  on  large  data  sets  that  contain 
a  large  number  of  support  vectors.  The  current  version  of  the  algorithm 
has  been  tested  with  a  data  set  of  110,000  data  points  and  100,000  support 
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vectors  on  a  machine  with  40  Mb  of  RAM.  No  attempts  to  optimize  and 
speed  up  the  algorithm  have  been  made  yet.  We  believe  that  this  algorithm 
starts  to  meet  the  increasing  need  to  deal  with  data  sets  where  both  the 
number  of  data  points  and  the  number  of  support  vectors  are  of  the  order  of 
10^.  Problems  with  these  characteristics  are  likely  to  be  found  in  the  area  of 
financial  markets,  where  lots  of  data  maybe  available  but  little  generalization 
error  is  expected. 
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Abstract 

Principal  Component  Analysis  (PCA)  and  Linear  Discriminant  Analysis  (LDA)  are 
two  statistical  tools  utilized  in  many  signal  processing  areas  such  as  data  compres¬ 
sion  and  pattern  recognition.  On-line  algorithms  such  as  Oja’s  rule  have  found  wide 
application  for  PCA  and  it  has  been  generalized  in  our  previous  paper  to  LDA 
(Fisher  criterion)  using  the  framework  of  gradient  descent  learning.  In  this  paper, 
the  rule  is  further  extended  to  accept  the  case  of  constraints  in  the  feature  extraction 
stage  as  is  often  necessary  for  real  world  applications.  As  examples,  the  new  rule  is 
applied  to  regularizers  in  the  form  of  both  a  2-dimensional  (2-D)  Gaussian  filter  for 
hand-written  digit  classification  and  determination  of  the  memory  depth  of  the 
Gamma  filter  for  isolated  word  recognition.  Results  show  the  good  behavior  of  the 
learning  rule  and  the  advantage  of  using  regularization  for  improved  generalization. 


1.  Introduction 

There  are  basically  two  different  roles  for  feature  extraction  depending  upon  the 
nature  of  the  signal  processing  problem.  In  signal  representation  PCA  has  been 
shown  to  be  the  best  linear  feature  extractor,  while  LDA  using  the  Fisher  criterion 
is  a  common  choice  in  classification.  In  both  cases,  the  feature  extraction  stage  is 
crucial  for  the  overall  performance.  Recently,  alternate  methods  based  on  informa- . 
tion  theory  have  been  proposed  which  seek  statistical  independent  features  also 
known  as  independent  component  analysis  -  ICA  ([4]).  Although  naive  compared 
to  ICA,  PCA  is  well  established  and  on-line  training  methods  such  as  Oja’s  rule 
have  been  proved  very  robust.  More  recently,  several  on-line  training  algorithms 
have  been  independently  proposed  for  LDA,  [1],  [2],  [3],  but  the  state-of-the-art  is 
less  developed.  In  our  previous  paper  [1],  Oja’s  rule  was  generalized  to  LDA  under 
the  framework  of  the  gradient  descent  method,  which  we  called  generalized  Oja’s 
rule  (GOR).  The  experiments  in  [1]  have  shown  the  effectiveness  of  the  proposed 
GOR. 

The  purpose  of  this  paper  is  two-fold:  First,  GOR  will  be  further  extended  to  the 
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case  where  constraints  are  imposed  on  the  linear  feature  extractor.  This  is  an  impor¬ 
tant  practical  problem,  because  preprocessing  can  be  thought  of  as  a  form  of  con¬ 
straint  and  in  this  way  the  function  of  the  preprocessor  is  integrated  with  the 
classifier  for  better  performance.  One  application  is  regularization  or  smoothing 
which  is  needed  to  improve  generalization  and  can  be  formulated  as  constraints  on 
the  model  weights.  Based  on  the  extended  GOR,  we  derive  as  special  cases  the 
training  rule  for  LDA  applied  to  digit  recognition  regularized  with  a  2-D  Gaussian 
filter,  and  to  the  training  of  the  memoiy  depth  of  the  Gamma  memory  for  word  rec¬ 
ognition.  The  second  purpose  of  the  paper  is  to  compare  the  performance  of  differ¬ 
ent  classifiers  based  on  PCA,  non-regularized  and  regularized  LDA,  and  a  MLP 
(multilayer  perceptron)  on  hand-written  digit  data.  The  adaptation  of  the  Gamma 
memory  recursive  parameter  suffers  in  general  from  local  minima  and  in  particular 
from  a  lack  of  an  adequate  criterion  in  temporal  pattern  recognition  settings.  We 
show  that  the  time  scale  parameter  is  much  more  robustly  adjusted  by  the  GOR 
than  by  the  usual  back-propagation  method. 


2.  Generalized  Oja’s  Rule  with  Constraints 


This  work  addresses  the  linear  network  shown  in  Figure  1  which  has  an  output 
y  =  w^x .  In  [1],  we  showed  that  the  generalized  Oja’s  rule  updates  the  weights  as: 


Aw  = 
w'^b  ~  1 


(1) 


where  ^  is  a  vector  which  serves  as  the  basic  measurement  for  the  problem  under 
analysis.  For  PCA  the  measurement  vector  is  w ,  the  weight  vector  itself.  For  LDA 
the  measurement  vector  is  the  class  mean  difference  m  ,  which  depends  upon  the 
scatter  matrix  [1]. 


Figure  1.  Network  for  PCA  or  2  class  LDA 

We  formulate  a  constraint  as  a  functional  dependence  of  the  weights  w  on  a  vector 
of  parameters 


w(v)  =  (wi(v),  W2(v),  ...,  w„(v))^  (2) 

where  v  e  i?"*  are  parameters  and  usually  m<n  .We  assume  that  w  is  dififeren- 
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liable  with  respect  to  v .  There  are  two  main  reasons  to  use  this  formulation.  First, 
we  can  use  the  chain  rule  to  compute  the  sensitivity  of  the  cost  with  respect  to  the 
new  parameter  set.  Secondly,  linear  filters  with  kernels  can  be  regarded  as  con¬ 
straint  on  is  projection  weights. 

Let  y  =  x)  =  w(v)^x  denotes  the  output  of  the  linear  network  when  the 
input  is  X ,  By  applying  the  gradient  method,  we  get  (3)  which  is  the  extended  GOR 
for  the  case  with  constraints: 

^  Av  =  (3) 

i 

Jiw,b)  =  i 

If  the  constraints  can  be  implemented  in  a  recursive  form  such  as  in  the  Gamma  fil¬ 
ter  [6],  the  calculation  of  the  partial  derivative  in  (3)  can  be  done  by  backpropaga- 
tion,  as  will  be  shown  in  the  following. 

3.  Regularized  LDA  with  a  2-D  Gaussian  Filter 

Real  world  problems  are  very  often  data  bound,  i.e.  in  high  dimensional  spaces 
there  are  no  enough  data  to  train  the  model  and  to  avoid  overfitting.  One  alternative 
is  to  use  regularization  or  smoothing.  As  an  example,  let’s  look  at  hand-written 
digit  recognition  problem,  where  each  digit  is  a  binary  or  multi-level  2-D  image  of 
size  P^Q  (here  24x  18).  A  quadratic  classifier  requires  the  estimation  on  the 
order  of  93,528  parameters  which  is  impractical.  Using  the  analytic  approach  at 
least  10  times  these  samples  are  recommended  by  practitioners.  Iterative  techniques 
are  able  to  solve  the  problem  with  less  data,  but  overfitting  becomes  a  problem. 
LDA  using  the  Fisher  criterion  also  requires  the  estimation  of  the  covariance 
matrix,  so  this  reasoning  applies  to  our  method. 

To  prevent  overfitting,  a  smoothing  constraint  is  applied  to  the  projection  vector  w . 
One  possible  solution  is  to  apply  a  linear  combination  of  2-D  Gaussian  kernels  to 
the  input  image  which  is  equivalent  to  a  2-D  linear  lowpass  filter  with  an  impulse 
response  which  is  Gaussian.  The  advantage  of  using  a  Gaussian  filter  is  that  each 
Gaussian  kernel  is  symmetric  and  the  problem  can  be  easily  formulated  in  the 
scheme  of  equation  (3).  Note  that  the  Gaussian  kernel  is  used  as  a  weighting  func¬ 
tion  over  the  input  instead  of  as  a  nonlinear  operator  applied  to  the  input  space  as 
done  in  the  radial  basis  networks  (RBFs). 

In  Fig.2,  each  pixel  in  the  image  is  indexed  by  (/,  J)  (/=  1, P;J=  1, 0 
and  is  ordered  to  form  the  data  vector  x .  Correspondingly,  the  elements  of  the  pro¬ 
jection  vector  w  can  also  be  indexed  by  (/,  J)  and  ordered  in  the  same  way.  Fig.  2 
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depicts  the  2-D  Gaussian  kernels  on  a  2-D  grid.  The  center  of  the  top  left  Gaussian 
kernel  is  (/q,  Jq)  ,  where  Iq  and  Jq  are  real  values,  not  necessarily  integers. 


Figure  2.  Regularized  LDA  Network  with  Gaussian  kernel 


Thus,  the  center  indexed  by  {ij)  is  determined  by  ((/-!)/;  +  /q,  (/ -  1  )/y  +  Jq\ 

where  Ij  and  Ij  are  also  real  values  and  represent  the  distance  between  neighboring 
Gaussian  kernels  in  the  vertical  and  horizontal  direction  respectively.  So,  the  output 
of  the  network  can  be  expressed  as: 


(4) 


where  g^j  represents  the  Gaussian  kernel  at  index  {iJ)  and  c^j  is  the  correspond¬ 
ing  linear  combination  coefficient,  a  is  a  positive  real  value  related  to  the  variance 
of  the  Gaussian  kernel  and  can  be  used  to  adjust  the  degree  of  smoothing. 


Let  A*  B  represent  the  element  by  element  multiplication  between  two  matrices. 
Let  Z>(0  =  4  +  (i-l)/;-/  and  D{j)  =  J^  +  (j-\)lj-J  be  the 
vectors  ordered  in  the  same  way  as  .r  and  w  .  By  applying  (3),  we  can  derive  the 
adaptation  rule  for  c-~ ,  /q,  Ij  and  a  as  follows: 

/  \ 

^0  =  {yjix-ybf  \  U-J 
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Ac  = 


'^y(.x-yb)‘ 


m  n 

/  =  iy=i 


/a 


c/q,  Ij  have  similar  formulas  to  /q  and  //  respectively.  Notice  that  with  this  for¬ 
malism  the  regularization  constant  a  can  be  adapted  for  optimal  performance  in  the 
training  set. 


4.  Experiment  bn  Hand-written  Digit  Recognition 

The  data  includes  100  hand-written  digit  “3”  and  100  hand-written  digit  “8”  with 
size  24x  18.  We  choose  digits  “3”  and  “8”  because  they  are  similar,  differing 
mainly  in  their  left  side.  We  select  10  exemplars  of  each  for  training  to  illustrate  an 
extreme  case  of  a  small  data  set.  Solving  LDA  by  the  numerical  method  would  not 
be  possible  because  of  the  scatter  matrix  singularity.  On-line  PCA,  non-regularized 
LDA  and  Gaussian  regularized  LDA  are  applied  to  the  data,  trained  by  Oja’s  rule, 
GOR  and  the  extended  GOR  respectively.  A  multilayer  perceptron  (MLP)  is  also 
applied  to  the  data  for  comparison,  although  all  the  above  methods  are  linear  net¬ 
works.  The  results  are  shown  in  Figures  3,  4,  5,  and  6.  All  the  left  images  represent 
the  normalized  projection  (largest  eigenvector  for  PCA  and  Fisher  direction  for 
LDA)  after  convergence,  with  white  pixels  corresponding  to  positive  values.  All 
the  histograms  at  the  right  reflect  the  distribution  of  digits  “3”  and  “8”  after  the  pro¬ 
jection.  The  smallest  classification  errors  are  also  indicated  in  the  figures. 

From  Fig.  3,  we  can  recognize  both  the  “3”  and  “8”,  which  suggests  that  PCA  pre¬ 
serves  the  important  information  for  representing  both  the  “3”  and  “8”.  However, 
the  two  classes  overlap  in  the  PCA  projection  even  for  the  training  data  set,  which 
simply  means  that  PCA  is  not  the  most  appropriate  feature  extractor  for  classifica¬ 
tion.  Fig.  4  and  Fig  5  show  the  results  for  both  the  conventional  and  Gaussian  regu¬ 
larized  LDA.  Fig.  6  shows  the  result  for  a  MLP  with  5  hidden  and  one  output  nodes 
applied  directly  to  the  data.  We  can  see  that  the  training  data  of  the  two  classes  have 
been  separated  completely  by  all  three  methods,  and  regularized  LDA  obtains  the 
best  generalization  on  the  test  set  for  this  problem.  The  left  images  simply  show 
that  non-regularized  LDA  is  too  noisy  and  overfitted  to  the  training  data,  while  reg¬ 
ularized  LDA  still  sustains  a  degree  of  smoothness  and  preserves  separation. 

Notice  the  asymmetric  and  long  tailed  distributions  created  by  the  MLP  outputs, 
which  are  characteristic  of  a  nonlinear  mapper.  Recall  also  that  the  LDA  networks 
are  linear,  but  the  weights  have  been  determined  to  enhance  separation.  Although 
our  network  is  not  a  RBF  network,  we  believe  that  GOR  could  be  used  advanta¬ 
geously  to  train  the  linear  weights  in  RBFs. 
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Figure  6.  Result  of  Multilayer  Perceptron  (5  hidden  nodes).  6  errors  for  testing 
data,  (a)  training  data  distribution;  (b)  testing  data  distribution 

5.  Regularized  LDA  applied  to  Gamma  memory  training 


The  gamma  memory  is  the  building  block  for  both  the  gamma  filter  and  the  gamma 
neural  model  ([6],  [7]).  The  hallmark  of  this  structure  is  a  delay  element  which  is 
recursive  and  can  be  adapted  on-line  with  the  output  error  using  gradient  descent. 
For  signal  representation  this  adaptation  method  is  reasonable,  but  for  classification 
of  temporal  patterns  the  method  is  a  compromise  when  the  classes  require  different 
memory  depths.  In  this  section,  the  recursive  parameter  of  the  gamma  memory  and 
the  weights  of  the  gamma  filter  (Figure  7)  will  be  adapted  using  regularized  LDA. 

This  structure  constitutes  the  first  layer  of  the  focused  gamma  network  [7]. 


Figure  7.  Gamma  Memory  +  Linear  Classifier  (Gamma  Filter). 


As  shown  in  Figure  7,  data  are  first  projected  onto  the  gamma  memory  and  then  a 
linear  projection  is  applied.  The  model  can  be  formalized  by  (5) 


y  =  f{x)  =  G^ X  =  {Gw)^ X 


r 


--gq) 


(5) 


where  q  is  the  number  of  taps  in  the  Gamma  memory,  g.  e  i?”  is  a  column  vec¬ 
tor  which  is  the  order-reversed  Gamma  kernel  (impulse  response)  at  tap  i  truncated 
with  length  «  .  G  is  the  function  of  Gamma  parameter  p  .  By  applying  the  gener¬ 
alized  Oja’s  rule,  the  adaptation  rule  for  the  model  can  be  obtained  as  follows: 
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(6) 


/  i 

i 

T  T 

Normalization:  scaling  w  so  that  f(b)  =  w  G  b  =  I 

where  G  x  and  G  b  are  the  projection  of  input  x  and  base  vector  b  to  the 
gamma  memory  respectively.  Since  the  gamma  memory  is  a  recursive  structure, 
df{Xi)/d\x  and  df{b)/d]x  can  be  calculated  by  backpropagation  through  time 
(BPTT).  As  (5)  and  Fig.  7  show,  the  model  can  be  regarded  as  a  filter.  So,  Gw  is 
the  truncated  order-reversed  impulse  response  with  length  n  and  can  be  generated 
by  activating  the  model  with  an  impulse.  The  normalization  can  be  done  at  the 
same  time  by  adjusting  the  weights  w  j , . . . ,  . 

It  is  not  difficult  to  extend  the  result  of  (6)  to  the  case  of  independent  multi-channel 
Gamma  memories  for  each  channel  as  required  for  speech  recognition.  If  we  regard 
the  whole  projection  as  the  concatenation  of  the  projection  of  each  channel,  then 
the  base  vector  for  PCA  in  this  case  is  the  concatenation  of  the  truncated  order- 
reversed  impulse  response  for  each  channel.  For  the  two-class  Fisher  LDA,  the  base 
vector  is  the  class  mean  difference  which  is  easy  to  understand  and  obtain. 


6.  Experiment  on  Isolated  Word  Recognition 

The  data  are  two  classes  isolated  spoken  words  “wash”  and  “cash”  from  the  TIMIT 
database.  There  are  only  7  examples  for  each  word  in  the  database.  For  this  prob¬ 
lem,  we  are  solely  interested  in  finding  the  optimum  value  of  the  gamma  parameter 
\i  that  best  discriminates  between  the  two  words.  The  use  of  long  tap  delay  lines  is 
discourage  due  to  the  large  networks  that  they  produce,  so  the  gamma  memory  is 
appealing  and  has  produced  good  results  [8].  Common  sense  tells  us  that  the  major 
difference  between  “wash”  and  “cash”  lies  in  the  beginning  of  the  words.  Different 
values  of  the  gamma  parameter  will  let  the  model  focus  on  different  regions  of  the 
input  pattern.  Therefore,  finding  an  optimum  p.  such  that  the  model  concentrate  on 
the  most  discriminant  part  of  the  input  data  will  improve  performance.  Since  back- 
propagation  with  output  MSE  (mean  squared  error)  ean  also  be  used  to  adjust  the 
gamma  parameter  in  the  gamma  filter,  this  method  will  be  compared  with  the  GOR 
training. 

A  16-channel  constant  Q  filter  bank  (Mel  scale)  is  used  as  the  pre-processing  of  the 
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speech  signal.  For  a  clear  illustration,  only  the  results  of  1 -channel  data  are  pre¬ 
sented  (15th  channel).  The  top  panel  of  Figure  8  was  obtained  with  the  optimal 
weights  and  by  stepping  p  from  0  to  2.  This  panel  measures  discrimination  (Fisher 
criterion)  as  a  function  of  p  and  clearly  shows  that  the  ability  to  discriminate 
between  the  two  words  depends  heavily  upon  the  value  selected  (from  0.3  to  0.5; 
values  of  p  larger  than  1  are  expected  to  produce  low  discrimination  (filter  becomes 
an  highpass  filter)).  We  can  see  that  the  optimal  p  is  around  0.4  which  is  consistent 
with  the  observation  that  long  delays  are  needed.  Longer  delays  will  compromise 
resolution  too  much  due  to  the  lowpass  nature  of  the  kernel,  and  discrimination 
drops.  When  the  backpropagation  method  is  utilized  (second  panel)  the  value  of  p 
obtained  is  far  from  the  optimum.  This  attest  the  difficulty  of  adapting  p  for  pattern 
recognition  applications  due  to  the  conflicting  requirements  of  finding  the  best 
scale  to  represent  more  than  one  class.  The  generalized  Oja’s  rule  (third  panel)  con¬ 
verges  to  the  optimal  value  of  p  within  100  iterations. 

7.  Conclusion 

This  paper  shows  that  the  generalized  Oja’s  rule  can  be  extended  even  when  the 
weights  are  a  function  of  hidden  variables  as  may  be  necessary  to  solve  practical 
problems.  LDA  is  a  linear  procedure  that  exploits  the  difference  among  classes 
contained  in  both  the  mean  and  the  covariance  (or  scatter)  matrix.  So  it  is  sensitive 
to  poor  estimations  of  parameters  which  are  most  often  caused  by  insufficient  data. 
Regularization  prevents  the  model  from  overfitting  even  when  the  data  yields  rank 
deficient  covariance  matrices.  This  is  one  obvious  advantage  of  our  on-line  LDA 
over  the  conventional  (numeric)  LDA  implementation  where  a  matrix  inverse  is 
required.  The  idea  of  optimally  smoothing  the  data  before  LDA  can  be  effectively 
incorporated  in  the  extended  GOR  learning  rule  as  was  demonstrated  here.  Note 
that  the  parameters  of  the  Gaussian  smoother  are  adapted  during  operation,  so  the 
method  extracts  as  much  information  as  possible  from  the  training  data.  Although 
the  examples  are  all  two-class  problems,  GOR  can  be  extended  to  multiple  classes 
as  we  will  report  in  a  following  paper. 

The  results  for  the  adaptation  of  the  recursive  parameter  in  the  gamma  memory  are 
also  important  since  they  present  for  the  first  time  a  technique  to  adapt  the  p  param¬ 
eter  for  temporal  pattern  recognition  applications  where  the  method  of  minimizing 
the  output  mean  square  error  is  not  the  most  appropriate. 

Acknowledgments:  This  work  was  partially  supported  by  NSF  grant  ECS- 
9510715. 
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Figure  8.  Adaptation  of  p  First  panel  shows  that  the  best  value  to  discriminate 
is  \x=0.4.  Second  panel  shows  that  MSB  drives  p  towards  0.9.  Third 
panel  shows  that  GOR  finds  the  right  value  of  p  in  around  100  iteration. 
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Abstract 

Beside  the  use  of  purely  neural  systems,  the  combina¬ 
tion  of  preprocessing  units  and  neural  classifiers  has  been 
used  for  a  variety  of  signal  segmentation  and  classification 
tasks.  Whereas  this  approach  reduces  the  input  dimen¬ 
sionality  as  well  as  the  complexity  of  the  classification 
problem,  its  performance  crucially  depends  on  a  proper 
preprocessing  scheme,  i.e.,  feature  extraction.  In  this 
contribution,  adaptive  preprocessing  units  (frequency- 
selective  quadrature  filters)  are  proposed  that  can  be  ad¬ 
justed  in  order  to  provide  optimal  features.  The  mean 
frequencies  of  the  filters  are  tuned  to  minimize  the  clas¬ 
sification  error.  Both  FIR-  and  I IR— based  filters  are 
introduced  and  compared  with  respect  to  their  conver¬ 
gence  properties  and  the  classification  results.  Results 
for  the  solution  of  an  EEG  segmentation  task  using  the 
combined  system  are  given. 


1  Introduction 

A  common  problem  in  biosignal  processing  is  the  classification  (i.e.,  label¬ 
ing  of  samples)  and  segmentation  (i.e.,  discrimination  between  samples)  of 
signals.  From  a  more  general  viewpoint,  this  problem  states  a  supervised 
learning  task  with  the  ordering  of  the  learning  samples  given  by  time.  Due  to 
their  well-known  universal  approximation  capability.  Neural  Networks  (NNs) 

*This  work  was  supported  by  the  the  German  Ministry  for  Education  and  Research 
(project  “Clinical  Oriented  Neurosciences”,  01  ZZ  9602  )  and  the  Thuringian  Ministry  for 
Science,  Research  and  Arts  (project  ITHBRA,  B51 1-95004). 
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seem  to  be  a  suited  approach  for  a  solution.  Whereas  a  variety  of  recurrent 
NN  models  have  been  proposed,  particularly  for  prediction  tasks,  we  focus  in 
the  sequel  on  feedforward  networks  mainly  for  three  reasons: 

1.  Efficient,  robust  (with  respect  to  the  initial  conditions)  and  easily  im- 
plementable  training  algorithms  that  do  not  have  to  consider  stability 
restrictions  are  available  for  feedforward  NNs. 

2.  Recurrency  is  known  to  “blur”  segment  borders.  Whereas  in  the  linear 
case  (HR  filter)  this  influence  is  relatively  easy  to  be  estimated,  this  is 
generally  not  possible  for  nonlinear  systems. 

3.  A  consistent  theory  for  the  generalization  of  recurrent  networks  has  not 
yet  been  developed.  However,  there  is  some  evidence  that  at  least  par¬ 
tially  recurrent  networks  might  lack  intrinsic  generalization  capability, 
e.g.  with  respect  to  sampling  rate  changes  [8]. 

Using  feedforward  networks,  the  relevant  part  of  the  signal’s  past  has  to  be 
fed  explicitly  into  the  network.  Back  and  Tsoi  [2]  proposed  the  use  of  FIR- 
synapses  that  correspond  to  the  parallel  presentation  of  delayed  samples. 
One  can  avoid  the  significant  increase  of  the  input  dimensionality  connected 
herewith,  if  a  suited  preprocessing  of  the  signal  is  feasible  that  extracts  the 
information  for  the  classification  task.  A  number  of  authors  have  adopted  this 
combination  of  signal  processing  units  with  neural  classification  e.g.  [7],  [5].  If, 
however  the  parameters  of  the  preprocessing  system  are  not  controlled  by  the 
classification  error,  the  performance  of  the  combined  system  strongly  depends 
on  the  a  priori  available  knowledge,  thus  questioning  the  very  advantage  of 
the  neural  approach  as  a  modelfree  method. 

In  the  following,  we  introduce  LTI  filters  with  adaptable  mean  frequencies 
that  constitute  the  basis  of  a  more  general  class  of  nonlinear  adaptive  signal 
processing  units  (ASPUs).  An  algorithm  for  the  adaptation  of  these  systems 
will  be  given  in  3.  In  section  4  the  proposed  approach  is  applied  to  the 
segmentation  of  discontinuous  EEG. 


2  Adaptive  Signal  Processing  Units 

A  signal  processing  unit  provides  a  mapping  of  the  time  series  {x(n)}„=i,2,,.. 
onto  a  set  of  /  features 

Zi{n)  x{n),  x{n-l),  ...)  2  =  1.../ 

which  are  fed  into  a  neural  classifier  with  the  input-output-relation  o{n)  = 
o(z(n),  W)  (with  W  denoting  the  weights).  In  order  to  integrate  the  deter¬ 
mination  of  the  optimal  parameter  vector  p*  that  controls  the  calculation  of 
the  features  into  the  NN  training,  the  gradient  Vp.z^^^)  has  to  be  computed. 
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2.1  FIR  bandpass  filter  with  adaptable  mean  frequency 


The  desired  frequency  response 


{Orn  ~~  Aa;/2  <  |^l  ^  +  Aw/2 

otherwise 


can  be  approximated  using  the  FIR  coefficients 


hk  = 


^  co.{u>^{k  -  ^))sm{^{k  -  ^)) 


M-1 

z[n)  =  hkx{n  -  k) 

k=0 


i{k  =  ^ 


(k  =  0...M  -1)  where  the  i^k  correspond  to  a  suited  window  function.  The 
computation  of  the  gradient  is  equivalent  to  a  filtering  of  the  input  time  series 
with  the  coefficients 


2  .  ,  M-1,,  .  ,Aw,,  M-1,, 

K  =  - 2 Sin(  — (A! - -  )) 

^  =  "^Kxin-k) 

to 


As  h  and  are  symmetric,  both  the  bandpass  and  gradient  filter  realize 
a  linear  phase.  Despite  its  computational  ease,  the  FIR  filter  suffers  from 
several  drawbacks: 

1.  The  minimal  band  with  depends  on  the  filter  degree.  Aw  > 


2.  Due  to  its  rippled  amplitude  gain,  the  adaptation  of  the  mean  fre¬ 
quency  is  rather  sensitive  against  local  minima.  (For  harmonic  in¬ 
put  signals  it  can  be  shown  that  for  arbitrary  error  functions  e(n)  = 
e(2/(n),o(2:(Ti),-W))  the  gradient 


E 


de[n) 

dcorn 


=  E 


de(n)  dz{n) 
dz{n)  du)m 


has  at  least  as  many  zeros  as  there  are  sidelobes  in  the  frequency  re¬ 
sponse.) 

As  particularly  the  last  point  seems  to  be  a  severe  limitation  for  an  adaptive 
preprocessing  system,  we  introduce  the  adaptable  HR  resonance  filter. 


2.2  HR  resonance  filter  with  adaptable  mean  frequency 

An  HR  system  with  the  poles  realizes  the  frequency  response 


H(u)  = 


_ bo _ 

(1  _  reJ(‘^-‘*'o))(i  _  re^{oj+wo)y 


(1) 
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As  the  poles  are  conjugate  complex,  this  system  is  realizable: 

z(n)  4-  ai2:(n  ~  1)  +  a2z{n  —  2)  =  box{n)  (2) 

oi  =  — 2r’cos(tJo) 

a2  ~  T 

bo  =  (1  —  r)\/l+r2  —  2r  cos(wo) 

For  poles  near  the  unit  circle  (r  clearly  has  to  be  smaller  than  1  to  ensure 
stability)  the  mean  (or  resonance)  frequency  is  approximately  equal  to  cjq. 
The  amplitude  gain  is  unimodal  and  the  3djB-bandwith  does  not  primarily 
depend  on  the  order: 


AoJsdb  ^  — 7=-  for  2nd  order  systems 
V> 

(although  an  increase  of  order  corresponds  to  a  concatenation  of  systems 
of  2nd  order  and  thus  naturally  decreases  the  band  with).  For  the  off-line 
adaptation  the  computation  of  the  gradient  is  simple: 


dz{n) 

d(Vo 


dbo  dai  ^ 

dijo  ^  ^  duo  ^  ^  ^ 

2(1  —  r)  sin(2a)o) 


ai 


i=l 


dz{n  —  i) 
duo 


(3) 


y/1  +  —  2rcos(ct;o) 


x{n)  —  2r  sin(6t;o)2;(^  ~  1)  + 


^  dz(n  -  1)  2  dz(n  -  2) 

2rcosa;o — ~  ^  ^ 


duo 


duo 


Eq.  (3)  can  also  be  used  for  the  on-line  adaptation,  if  the  changes  of  the 
mean  frequency  are  small  during  any  time  interval  (t,t  +  r),  where  r  denotes 
the  group  delay  around  the  mean  frequency 


r 


2r^ 

1  _  7.2 


r  >  0.8. 


Note  that  as  their  recursive  coefficients  are  identical,  the  stability  of  the 
resonance  filter  implies  the  stability  of  the  gradient  filter.  An  extension  of 
(2)  and  (3)  onto  higher  order  filters  is  straightforward. 

The  HR  filter  seems  to  be  superior  for  adaptation  as  it  eliminates  both 
mentioned  problems  of  the  FIR  filter.  It  does  however  have  a  nonlinear 
phase  gain  resulting  in  a  frequency  dependent  group  delay.  This  effect  can 
be  reduced  using  a  suited  allpass  filter  [3]. 


2.3  Frequency— selective  Quadrature  Filters  with  adapt¬ 

able  mean  frequency 

Frequency-selective  Quadrature  Filters  (FSQFs)  can  be  constructed  as  a 
combination  of  bandpass  filters  described  above  and  generic  broad-band  QFs, 
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that  can  be  approximated  using  FIR  filters  of  order  Mgp  with  coefficients  q 
[6].  Thus  one  gets 

^QF 

z{n)  -  ^  qkz\n-k) 

k-Q 

■^OF 

dz^{n)  sr^  ^  dz'in) 

where  z'  denotes  the  output  of  either  an  HR  or  FIR  bandpass  filter. 

These  filters  can  be  used  to  detect  the  envelope  of  an  amplitude-modulated 
signal  given  a  mixture  of  superposed  components  or  noise.  The  adaptation 
task  could  for  example  be  to  adapt  the  mean  frequency  of  the  FSQF  to 
the  (unknown)  frequency  of  the  carrier  of  an  AM— signal,  assuming  that  the 
modulating  function  is  known. 

3  Extended  Backpropagation  for  Training  AS- 
PUs  and  NNs 


Figure  1;  Algorithm  for  controlling  the  steplength  (parameter  s'),  a  < 

1,  /3>1 
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The  systems  proposed  in  the  previous  section  enable  us  to  compute  the 
gradient  of  the  output  error  (usually  the  squared  difference  of  the  target  and 
the  NN’s  output)  with  respect  to  the  network  weights  (standard  backprop- 
agation)  and  the  mean  frequencies.  However,  as  the  values  for  ojo  are 
restricted  on  the  intervall  (0,  tt),  the  usual  gradient  descent  technique  can  not 
be  applied.  For  the  deterministic  off-line  adaptation  a  variety  of  methods  for 
the  solution  of  restricted  optimization  problems  can  be  used.  Since  most  of 
them  transform  the  task  into  a  series  of  unrestricted  problems,  they  can  not 
be  extended  onto  the  on-line  adaptation.  A  simple,  yet  effective  method 
that  can  be  applied  to  both  approaches  is  to  control  the  step  length  in  order 
to  guarantee  that  none  of  the  new  parameters  leaves  the  admissible  range 

(fig-  1)- 

4  Segmentation  of  discontinuous  EEG  using 
Neural  Networks 

4.1  Data  Material 

The  discontinuous  EEG  of  newborns  (  w  27th  to  32nd  week  of  conceptual 
age)  is  dominated  by  the  altering  occurrence  of  burst  and  interburst  patterns. 
In  [1]  it  is  pointed  out  that  a  pattern-selective  analysis  is  needed  for  an 
efficient  quantification  of  the  functional  development  of  the  brain  in  healthy 
newborns  as  well  as  of  the  severity  of  disturbances  of  newborns  at  risk.  For  the 
separation  of  burst  and  interburst  it  is  essential  to  detect  the  “initial  wave” 
at  the  beginning  of  the  burst  with  an  interindividually  different  frequency 
of  about  2.8 ...  12.8  Hz  and  to  track  the  significantly  increased  broad  band 
power  in  the  frequency  range  0.8  .. .  16  Hz. 

For  the  training  as  well  as  for  the  classification  records  of  frontal  elec¬ 
trodes  {Fpi,  Fp2)  of  healthy  newborns  between  the  27th  and  the  31st  week 
(conceptual  age)  sampled  at  128  Hz  have  been  used.  The  records  have  been 
segmented  into  bursts  and  interbursts  by  a  medical  expert. 


4.2  Preprocessing  on  the  basis  of  FIR  filters 

The  underlying  assumption  for  the  use  of  FSQFs  as  a  preprocessing  method 
is  that  the  occurrence  of  a  burst  is  characterized  by  the  consecutive  emerging 
of  power  maxima  in  different  frequency  bands.  This  assumption  has  been 
verified  by  different  studies  [9],  [1],  [4].  Thus  a  burst  can  in  principle  be 
detected  using  a  combination  of  amplitude  demodulators  working  in  different 
frequency  ranges. 

Due  to  the  application  of  adaptable  FSQFs,  the  frequency  ranges  can 
be  chosen  in  order  to  optimize  the  classification  results.  Table  1  shows  the 
initial  and  the  adapted  values  of  the  mean  frequencies  of  a  bank  of  FSQFs 
based  on  FIR  filters  (M  =  101)  which  was  coupled  with  a  (4-3-1)  MLP  with 
sigmoidal  node  functions.  Both  the  first  and  the  second  band  converge  to  a 
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band 

no. 

/m 

(initial) 

[Hz] 

A/ 

(fixed) 

[Hz] 

jm  (after 

training) 

[Hz] 

1 

5.5 

1 

5.8 

2 

7.9 

1 

5.8 

3 

10.2 

5 

11.1 

4 

11.8 

7 

13.0 

Table  1:  initial  values  and 
values  after  training  for 
the  mean  frequencies  of 
the  QF-bank  (FIR-based) 


band 

no. 

fm 

(initial) 

[Hz] 

^hdB 

(fixed) 

[Hz] 

fm  (after 

training) 

[Hz] 

1 

10 

0.2 

14 

2 

16 

0.2 

5.7 

3 

22 

0.2 

23 

Table  2:  initial  values  and 
values  after  training  for 
the  mean  frequencies  of 
the  QF-bank  (IIR-based) 


frequency  that  matches  the  assumed  value  of  the  “initial  wave”  (fig.  2).  The 
classification  results  are  shown  in  fig.  3. 


4.3  Preprocessing  on  the  basis  of  HR  filters 

In  order  to  demonstrate  its  robustness,  the  start  values  for  the  QF-bank 
based  on  IIR-filters  have  been  chosen  with  greater  intervalls  and  in  greater 
distance  to  the  frequency  of  the  “initial  wave”  (tab.  2).  Figure  2  (B)  shows 
the  convergence  of  the  respective  mean  frequencies.  The  results  of  the  IIR 
filter  validate  the  assumption  that  the  frequency  range  around  5.8  Hz  contains 
information  which  is  essential  for  the  segmentation  of  burst  and  interburst 
patterns.  Band  #3  seems  to  be  less  important,  as  its  mean  frequency  is  little 
affected  by  the  adaptation  procedure.  Band  :^2  is  covered  by  both  band  ^^3 
and  band  :^4  of  the  FIR-based  system. 

Figure  3  shows  a  segment  of  the  original  signal  and  the  corresponding  clas¬ 
sification  (a  (3-3-2)  MLP  was  chosen).  The  IIR  approach  performs  slightly 
better,  particularly  concerning  the  detection  of  the  burst  onset. 


5  Conclusions 

A  general  scheme  for  the  integration  of  adaptable  preprocessing  units  into  the 
training  process  of  a  neural  classifier  has  been  introduced.  In  contrast  to  a 
“pure  neural”  approach  this  method  allows  the  implementation  of  problem- 
specific  knowledge.  In  contrast  to  static  preprocessing,  however,  the  initial 
choice  of  the  preprocessing  system  needs  not  to  be  perfect  as  its  parameters 
can  be  adapted  during  a  process  very  similar  to  network  training. 

The  use  of  both  FIR  and  IIR  filters  for  the  construction  of  frequency 
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(A) 


iterations 


Figure  2:  Evolution  of 
the  training  error  (SSE) 
versus  the  convergence 
of  the  mean  frequencies 
(^ijW2,a;3)  for  the  FIR 
filter  (A)  and  the  HR 
filter  (B)  based  FSQFs. 
The  Ui  are  given  in  parts 
of  the  Nyquist  frequency, 
i.e.  fi  —Ui  X  6AHz 
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(A) 


^ — A—- — 

/ 

'  V  ..  A/N  A  AAW-.A — - 

2| - , - -  .  ■ 

'A/VA/v,  r\r.  l\ 

'  r^—r- 

(B) 


Figure  3:  Segmenation  of 
discontinuous  EEG  (elec¬ 
trode  -Fpi,  128  Hz  sam¬ 
pling  rate)  using  an  MLP 
combined  with  FIR  based 
(A)  and  HR  based  (B) 
preprocessing.  The  sub¬ 
plots  (A):{b)-(e)  show  the 
outputs  of  the  adapted 
FSQFs  constructed  with 
FIR  bandpass  filters.  The 
lower  subplots  of  both 
figures  show  the  target 
time  series  and  the  out¬ 
put  of  the  neural  classifier 
(bold).  (In  (B)  the  tar¬ 
get  has  been  scaled  for  im¬ 
proved  visibility.) 
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selective  ASPUs  has  been  proposed.  Whereas  FIR-based  systems  offer  a  dis¬ 
tortionless  transmission,  the  adaptation  of  their  frequency  parameters  is  usu¬ 
ally  more  complicated  than  for  HR  systems.  Furthermore,  the  latter  provide 
more  efficient  solutions  for  the  construction  of  extremely  narrow  passbands. 

For  a  signal  classification  task,  the  efficiency  of  the  new  approach  has 
been  demonstrated.  Systems  of  similar  structure  seem  to  be  applicable  for  a 
number  of  problems  where  either  events  of  similar,  though  interindividually 
different  characteristics  or  with  (slowly)  time-varying  features  have  to  be 
detected.  In  both  cases  a  moderate  learning  effort  could  save  a  significant 
amount  of  manual  evaluation. 
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Wave  propagation  within  a  “cortex”  of  neurons  is  introduced  as  a 
neural  coupling  mechanism.  Using  this  effect  for  the  control  of  the 
neural  learning  process,  the  network  generates  self-organizing  fea¬ 
ture  maps.  Additionally,  wave  propagation  is  used  to  influence  the 
neural  competition  in  representing  the  input  of  the  network.  By 
this  means  the  network  is  able  to  represent  temporal  aspects  of  the 
input.  Since  apart  from  a  global  bus  such  a  network  only  requires 
local  interactions,  connectivity  is  very  low  and  a  parallel  hardware 
implementation  suggests  itself.  Operation  of  a  demonstration  setup 
consisting  of  16  neurons  in  digital  technology  is  demonstrated  by 
the  representation  of  phoneme  sequences. 


1  INTRODUCTION 

Wave  propagation  caused  by  reactive  and  diffusive  processes  is  ubiquitous  in 
nature.  Such  waves  can  be  observed  in  several  biological  systems,  e.g.,  among 
cells  of  the  liver,  muscles,  in  ova  (for  a  review  see  [1])  or  within  networks  of 
glial  cells  in  the  brain  [2].  The  latter  example  is  especially  interesting,  since 
up  to  now  it  is  still  unclear  wether  and,  if  so,  how  glial  cells  participate  in 
the  information  processing  of  the  brain  [3].  Also  in  the  inanimate  nature 
there  are  several  examples  of  active  media  showing  wave  propagation,  e.g., 
chemical  and  gas-discharge  systems  or  semiconductor  substrates  (see  e.g.  [4]). 
In  these  cases  diffusion  is  the  spatial  interaction  that  enables  the  propagation 
process. 

Using  an  electrical  network  as  active  medium  we  have  demonstrated  that 
wave  propagation  can  serve  as  neural  coupling  mechanism  in  self-organizing 
topographic  feature  maps  [5].  This  facilitates  an  efficient  hardware  realization 
of  self-organizing  neural  networks.  Apart  from  a  data  bus  this  hardware 
concept  only  needs  local  interactions.  Thus,  connectivity  within  the  network 
is  very  low.  Since  intrinsic  dynamic  effects  of  active  media  are  used  for 
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information  processing,  this  concept  may  be  called  “synergetic”  hardware 
implementation  [6,  7].  Such  a  neural  hardware  is  very  interesting  in  view  of 
technical  applications,  since  self-organizing  feature  maps  are  used  in  various 
technical  fields  [8]. 

Furthermore,  feature  map  generation  is  interesting  from  the  biological  point 
of  view.  Topographic  feature  maps  can  be  observed  in  various  parts  of  the 
mammalian  cortex.  There  are,  e.g.,  retinotopic,  tonotopic  or  somatotopic 
maps  (see  e.g.  [9]). 


2  KOHONEN’S  ALGORITHM 

We  focus  on  Kohonen’s  algorithm,  which  is  the  most  perspicuous  model  of 
a  self-organizing  feature  map.  Each  neuron,  iVr,  of  the  network  is  located  in 
a  “cortex”  at  a  position  r.  Every  neuron  corresponds  to  a  reference  vector, 
Wr,  in  an  arbitrary  feature  space.  When  the  neurons  are  presented  with  a 
feature  vector,  U,  a  winner  neuron  Nruin  has  to  be  determined.  The  winner 
neuron,  which  is  located  at  cortical  position  rtyj„,  is  thought  to  best  represent 
the  feature  vector,  U: 


Twin  =  atg  inin(||U  -  Wr||).  (1) 

Such  a  minimum  detection  is  used  not  only  in  Kohonen’s  algorithm  but  also 
in  several  other  neural  network  models.  It  can  be  performed  in  parallel  by 
means  of  a  suitable  active  medium  [5]:  Every  neuron  feeds  a  stimulus  to 
the  medium.  This  stimulus  is  a  monotonously  decreasing  function  of  the 
feature  space  distance  ||U  —  Wr||,  so  that  the  highest  stimulus  is  produced 
by  the  winner  neuron  designate.  On  increasing  a  global  medium  parameter, 
the  highest  stimulus  causes  an  “ignition”  of  the  active  medium.  Detection 
of  this  ignition  terminates  the  parameter  increase  and  determines  the  winner 
neuron. 

The  essential  of  Kohonen’s  algorithm  is  the  learning  step,  which  affects  the 
winner  neuron  and  its  cortical  neighborhood.  Learning  is  an  adaption,  AWy, 
of  reference  vectors,  Wr,  in  direction  of  the  feature  vector,  U.  With  increas¬ 
ing  cortical  distance, 

dwtn(^)  ~  (2) 

between  a  neuron  at  position  r  and  the  winner  neuron  at  position  rt^in,  the 
relative  amount  of  reference  vector  adaptation  decreases.  This  interaction  is 
described  by  the  neighborhood  function,  ?;(diym(r)): 

AWr=?f(d^in(r))(U-Wr).  (3) 

Feature  vectors,  U,  are  thought  to  be  stochastically  presented  while  both 
height  and  width  of  the  neighborhood  function,  77(dwin(r)),  are  successively 
diminished. 


307 


ties,  a  wave  front  propagates  through  the  medium  starting  at  time  t—O  from 
the  point  of  ignition,  i.e.,  the  position  of  the  winner  neuron  (Fig.  1).  The 
simplest  (although  not  very  natural)  form  of  such  a  wave  front  propagating 


with  velocity  c  reads 


$(r,ii)  —  'Hict  —  llriom  **il 


(4) 


with  the  Heaviside  function,  7/(x)=0  if  ar<0,  and  %[x)=l,  otherwise. 
This  wave  front  triggers  the  learning  process  of  each  neuron  it  reaches: 


^ Wr(<)  =  0  if  $(r,  <)  <  0.5, 

at 


S 


Figure  2:  The  neighborhood  function,  r?(||ru,m  -  r||)  (Eq.  6),  for  different  values  of 
the  parameters  dmax  and  cr.  The  values  of  cr  are  equivalent  to  1=2,  1=4,  1=8  in 
the  digital  version  (Eq.  13). 


>?(d..„(r))  =  1  -  exp  if  (6) 

The  shape  of  the  neighborhood  function,  rj,  is  shown  in  Fig.  2  for  different 
values  of  the  parameters  dmax  and  cr. 


4  REPRESENTATION  OF  TEMPORAL  SEQUENCES 

Different  attempts  have  been  made  to  extend  or  modify  the  self-organizing 
feature  map  in  order  to  represent  temporal  aspects  of  the  presented  features. 
This  is  a  very  important  task  if  the  network  is  supposed  to  process  context 
information,  e.g.  in  speech  processing  systems  [10].  In  general,  such  a  task 
demands  a  representation  of  the  near  past.  This  can  be  achieved  by  means 
of  a  time-delay  architecture,  which  enables  the  system  to  process  a  certain 
amount  of  former  inputs  (for  a  review  see  e.g.  [11]).  One  can  apply  this 
technique  to  the  input  of  a  standard  Kohonen  feature  map:  A  number  of  time- 
delayed  input  vectors  are  concatenated  to  a  larger  vector  that  serves  as  feature 
vector,  U.  This  method  was  succesfully  used  to  improve  the  recognition  of 
transient  phonemes  in  a  speech  recognition  system  [10]. 

Another  idea  to  represent  temporal  dynamics  is  to  use  an  independent,  fully 
interconnected  network,  which  learns  the  transitions  within  the  cortex  of  the 
feature  map  ([12]).  Such  a  network  is  even  capable  of  reproducing  a  trained 
input  sequence.  A  more  implicit  representation  of  the  past  is  realized  by 
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Figure  3-  Three  waves  '®i(r,  t)  in  the  cortex  according  to  Eq.  (8)  started  by  three 
suLquent  winner  neurons  at  cortical  positions  vf"'.  Subsequent  vriimer  nelsons 
are  forced  by  the  wave  to  be  situated  in  a  certain  cortical  neighborhood  of  the 
respective  preceding  winner  neuron. 


providing  the  neurons  of  the  feature  map  with  a  retention  and  decay  of  their 
activity.  If  such  information  on  former  activity,  i.e.,  being  a  winner  neuron, 
influences  the  present  determination  of  the  winner  neuron,  then  the  network 
becomes  able  to  represent  temporal  sequences  [13]. 

Following  an  idea  of  Euliano  and  Principe  [14],  wave  propagation  can  be 
utilized  to  process  information  on  the  past  so  that  temporal  coherence  is 
represented  in  the  feature  map.  Propagation  and  interference  of  waves,  th^at 
are  attenuated  over  space  and  time,  influence  the  neural  competition:  De¬ 
termination  of  the  winner  neuron  is  modifled  with  the  result,  that  temporal 
neighborhood  of  feature  vectors  in  a  sequence  may  lead  to  an  ^jacent  rep¬ 
resentation  of  these  vectors  in  the  feature  map.  Euliano  and  Principe  choose 
a  predetermined  direction  of  wave  propagation.  We  will  restrict  propagation 
in  a  less  rigid  and  more  natural  way  leading  to  some  advantages.  Addition¬ 
ally,  we  omit  attenuation  and  interference  of  waves  for  the  sake  of  an  easier 
hardware  implementation. 

Consider  a  wave,  ®.(r,t),  with  a  concentric  wave  crest  of  certain  width,  6, 
propagating  through  the  cortex.  It  starts  at  time  t=U  at  the  position  of  the 
winner  neuron,  The  index,  i,  numbers  the  sequence  position  of  the 

corresponding  feature  vector,  Ui,  starting  with  i=l.  Wave  propagation  ends 
at  time  <,+i,  when  the  next  feature  vector,  Uj+i,  is  presented.  Using  the 
abreviations  Eq.  (2)  and 

(7) 
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such  a  wave  reads 


rr  nict'i  -  d^in{r))nid^in{r)  -  ct'^  +  b)H{r,t)  (8) 

where  again,  is  the  Heaviside  function.  The  “History  function”, 
restricts  wave  propagation  with  respect  to  the  past.  It  is  initialized  at  the 
beginning  of  a  new  sequence  with 

H(r,ti)  =  1,  for  all  r.  (9) 

Afterwards,  it  is  set  to  zero  at  a  position,  r,  if  the  crest  of  a  wave  has  left 
that  position,  i.e., 

(10) 

and  remains  zero  for  the  rest  of  the  sequence. 

As  a  result,  a  subsequent  wave  does  not  propagate  into  a  region  that  has 
been  psissed  by  an  earlier  wave  during  the  same  sequence.  This  resembles  the 
refractory  phase  of  real  neural  tissue.  Even  more  natural  dynamics  would 
allow  for  a  decay  of  H{r,t)  over  time.  Moreover,  this  could  be  understood 
as  an  explicit  measure  of  time  within  the  network  (compare  [13]). 

Wave  propagation  according  to  Eq.  (8)  and  Eq.(lO)  is  sketched  in  Fig.  3.  The 
final  states  of  three  subsequent  waves,  i.e.,  ^,  (r,  ;t,+i),  2=1. .3,  are  shown.  In 
the  final  wave  region  the  probability  for  a  neuron  to  win  the  competition 
and  to  become  the  winner  neuron  is  enhanced.  Equation  (1)  of  the  Kohonen 
algorithm  is  changed  to 

rf*”  =  arg  min(||Ui  -  Wr||  --  j^’J^i_i(r,fi)).  (11) 

Only  the  first  winner  («=1)  is  conventionally  determined,  i.e.,  ^o=0  in  Eq. 
(11).  In  this  case  Eq.  (11)  equals  the  original  Kohonen  Eq.  (1).  On  the  other 
hand,  if  /?  is  large,  the  position,  r^J,  of  the  next  winner  neuron  is  forced 
to  be  in  the  region  ^,(r,f,+i)=l  (Fig.  3),  i.e.,  in  a  certain  neighborhood 
of  the  preceding  winner  neuron.  The  later  the  succeeding  feature  vector 
is  presented,  the  farther  away  from  the  predecessor  it  will  be  represented. 
Since  this  is  a  competitive  ordering  principle,  there  may  arise  contradictions 
to  a  purely  topographic  mapping  (y^=0) ,  if  very  different  feature  vectors  are 
neighbors  in  the  sequence  or  if  very  similar  feature  vectors  appear  far  from 
each  other  in  the  sequence.  However,  by  tuning  /?  one  can  choose  what  more 
attention  is  paid  to:  feature  similarity  or  temporal  coherence. 

Figure  3  shows  a  parameter  set,  where  the  distance,  (ti^i  ~ti)c,  being  covered 
by  the  wave  is  distinctly  larger  than  the  cortical  distance  of  neighboring 
neurons.  Ax.  In  this  situation  it  is  possible  to  represent  a  noisy  feature 
cluster  with  more  than  one  neuron,  i.e.,  with  improved  resolution.  As  a 
consequence,  the  subsequent  feature  vector  will  be  represented  relatively  far 
away  from  the  predecessor.  Within  the  regions  of  the  single,  noisy  feature 
cluster,  the  mapping  is  purely  topographic.  A  disadvantage  in  such  a  case  is 
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that  the  total  length  of  a  sequence  that  can  be  entirely  represented  is  smaller 
than  in  the  case  {U+i  —  ti)c  «  Ax. 

If  a  sequence  is  too  long  to  be  entirely  represented,  i.e.,  the  wave  has  com¬ 
pletely  left  the  cortex  (^,(r,t)=0  over  the  whole  cortex),  Eq.  (11),  again, 
becomes  identical  with  Eq.  (1).  In  this  case,  the  next  winner  is  determined 
independently  from  its  predecessor  (standard  Kohonen).  Thus,  if  the  new 
winner  neuron  is  not  situated  in  a  refractory  region,  the  sequence  is  auto¬ 
matically  devided  into  two  independent  subsequences.  Otherwise,  no  wave  is 
started  and  the  system  determines  another  winner  in  the  standard  Kohonen 
manner. 

In  the  described  concept  there  are  two  independent  processes  being  governed 
by  wave  propagation:  1)  Influencing  the  winner  election  in  order  to  represent 
temporal  coherence  (Eq.  (11));  2)  Control  of  the  learning  process  of  the 
neurons  (Eq.  (5)).  Both  processes  can  be  governed  by  the  same  wave  since 
learning  can  also  be  triggered  by  a  wave  according  to  Eq.  (8)  instead  of  Eq. 
(4).  This  can  be  done  in  a  way  that  learning  is  not  affected  by  the  history 
function,  Eq.  (10);  i.e.,  refractory  neurons  can  learn  as  usual.  The  only 
restriction  is  that  the  parameter  dmax  in  the  neighborhood  function,  Eq.  (6), 
be  limited  to 

dmax  ^  (^*+1 

5  DIGITAL  HARDWARE  IMPLEMENTATION 


The  described  concept  was  developed  in  view  of  a  digital  hardware  realization. 
In  such  an  implementation  Eq.  5  is  replaced  by  its  time  discrete  analogue. 


AWr 


U-Wr 

I 


(13) 


with  a  parameter  1.  The  wave  is  replaced  by  a  spatially  discrete  “domino- 
effect”.  Eq.  13  is  executed  by  the  concerned  neurons  after  each  propagation 
step  of  the  domino-effect.  Owing  to  this  space-  and  time-discrete  procedure 
distance  measurement  in  the  cortical  space  is  changed  from  Euclidian  to 
Manhattan  distance.  The  distance  of  adjacent  neurons  is  set  to  Ax  =  1.  The 
neighborhood  function,  7? (dtt;m (**)),  is  the  same  as  given  in  Eq.  6  with  the 


transformation 


(14) 


Restricting  /  to  a  power  of  two  allows  to  carry  out  division  by  I  (Eq.  13)  in  a 
very  simple  way.  Since  the  whole  concept  does  not  require  any  multiplication 
and  because  connectivity  within  the  network  is  very  low,  the  concept  is  well- 
suited  for  chip  integration.  As  learning  is  done  stepwise  (Eq.  13))  rather 
than  continuously  (Eq.  5)  the  possible  range  of  dmax  is  extended  over  the 
limit  given  in  Eq.  (12)  to 


dmax  ^  (^*+1  ti)c  Ax. 


(15) 
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Figure  4:  A  single  neiiron  Nr  of  the  digital  setup.  A  feature  vector,  U,  is  presented 
to  all  neurons  via  a  globed  bus.  The  winner  neuron  is  determined  by  means  of 
a  binary  search  with  a  sequence  of  numbers  D:  if  there  is  a  neuron  detecting 
D  >  ||U  —  Will,  D  is  diminished,  otherwise  it  is  increased  by  the  current  power  of 
two.  When  the  search  is  over,  a  winner  neuron  is  found  and  a  new  front  is  stcirted 
by  this  neuron.  If  the  the  neuron  was  passed  by  the  former  front  not  longer  than  b 
clock  cycles  before,  it  belongs  to  the  crest  region  cind  increases  the  numbers  Dhy  /3. 
Thus,  a  winner  election  corresponding  to  Eq.  (11)  results.  After  a  winner  neiuon 
is  foimd,  each  neuron  detecting  that  the  front  has  passed  executes  the  learning 
rule  (13)  with  every  clock  cycle.  The  front  is  reeJized  by  clocked  OR-units  being 
interconnected  with  the  next  cortex-neighbors:  The  output  is  connected  to  the 
inputs  of  the  OR-units  of  the  neighboring  neurons  and  vice  versa.  After  n  clock 
cycles  the  OR-units  cire  reset  and  a  new  front  is  started  by  the  presentation  of  the 
subsequent  feature  vector. 


Operation  is  demonstrated  with  phoneme  sequences  processed  by  a  demon¬ 
stration  setup  which  consists  of  16  neurons  in  a  linear  chain.  Schematic 
operation  of  a  single  neuron  is  sketched  in  Fig.  4.  Each  neuron  is  emu¬ 
lated  by  a  PIC16C84  microcontroller.  The  components  of  the  feature  (U) 
and  reference  vectors  (Wr)  are  stored  as  8bit  values.  The  acoustic  signal  is 
sampled  (rate:  14kHz)  and  Fourier-transformed  (512  points  in  36.7  ms)  by 
a  digital  signal  processor  TMS320C26.  The  resulting  spectrum  is  grouped 
into  three  frequency  channels  (0.6-1. 0  kHz,  1.0-3. 5  kHz,  3. 5-7.4  kHz).  The 
energy  of  each  single  channel  contributes  one  component  to  a  3-dimensional 
feature  vector,  U,  which,  additionally,  is  normalized.  Each  feature  vector 
corresponds  to  one  spoken  phoneme,  independently  of  its  natural  length. 

In  the  1-dim  setup,  the  wave  travels  in  both  possible  directions  after  it  has 


313 


neuron 


Figure  5:  Representation  of  the  two  phoneme  sequences  “F  OO  T”  and  “F  EE  T” 
by  the  16  neurons  of  the  digital  demonstration  setup.  The  parameters  of  the  exper¬ 
iment  are  /3=765  (3x8bit,  i.e.  f3  is  “large”),  ct-=3  and  6=2  (Eq.  8,  compare  Fig. 
3)  and  1=16  (Eq.  13).  The  16  reference  vectors  (3-dim),  Wr,  are  initialized  with 
random  numbers  and  learning  is  performed  as  follows:  Ten  presentations  of  each 
sequence  (random  order)  with  parameter  dmaT=4,  20  presentations  with  dmaa:— 3, 
and  30  with  dmax=l^  (Eq.  6,  compare  Fig.  2).  The  probability  distributions  (condi¬ 
tional  probabilities  p(neuron|phoneme))  are  obtained  with  40  presentations  of  each 
sequence 


been  started  at  the  position,  rj'*”,  by  the  first  winner  neuron  of  a  sequence. 
Owing  to  the  refractory  behaviour  of  the  neurons,  the  second  feature  vector 
of  a  sequence  selects  a  direction.  The  initial  variability  allows  to  represent 
two  (in  a  2-dim  network  even  more)  sequences  with  the  same  beginning. 
This  is  shown  in  Fig.  5  for  the  two  3-phoneme  sequences  “F  OO  T”  and 
«F  EE  T”  after  60  random  presentations  of  each  sequence.  As  a  result  of 
the  self-organizing  process  the  neurons  are  arranged  in  clusters.  Each  clus¬ 
ter  represents  a  single  noisy  phoneme,  as  expected  in  the  Kohonen  model. 
Additionally,  adjacent  clusters  represent  subsequent  phonemes  in  order  to 
represent  the  temporal  order  in  each  sequence.  The  initial  “F”  of  both  se¬ 
quences  is  represented  in  common  in  the  middle  of  the  network,  whereas  the 
final  “T”s  of  each  sequence  are  represented  twice  by  each  end  of  the  network. 
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Abstract 

In  this  paper  we  propose  novel  computationally  efficient  schemas  for  a 
large  class  of  on-line  adaptive  algorithms  with  variable  self-adaptive  learning 
rates.  The  learning  rate  is  adjusted  automatically  providing  relatively  fast 
convergence  at  early  stages  of  adaptation  while  ensuring  small  final  misadjust- 
ment  for  cases  of  stationary  environments.  For  non-stationary  environments, 
the  algorithms  proposed  have  good  tracking  ability  and  quick  adaptation 
to  new  conditions.  Their  validity  and  efficiency  are  illustrated  for  a  non- 
stationary  blind  separation  problem. 

Keywords:  Adaptive  on-line  learning  algorithms,  Blind  equalization. 
Blind  separation  of  sources  from  instantaneous  mixture. 


1  INTRODUCTION 

The  problem  of  optimal  updating  of  a  learning  rate  (step  size)  is  a  key  prob¬ 
lem  encountered  in  a  wide  class  of  learning  algorithms.  Many  of  the  re¬ 
search  work  related  to  this  problem  is  devoted  to  batch  and/or  supervised 
algorithms.  Many  various  techniques  like  conjugate  gradient,  quasi-Newton 
methods  and  Kalman  filters  have  been  applied.  However,  relative  little  work 
has  been  devoted  to  this  problem  for  on-line  adaptive  algorithms  [1,8, 12, 13]. 

A  large  class  of  on-line  learning  algorithms  (both  supervised  and  unsuper¬ 
vised)  used  for  training  neural  networks  or  nonlinear  adaptive  systems  can 
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be  expressed  in  general  form  as  [2,  8,  12,  13]: 

0{k  +  1)  =  e{k)  -  v{k)g{k)  {k  =  0, 1, 2, ...),  (1) 

where  k  is  the  iteration  index,  0(k)  =  [0i(k),02{k),...,9n{k)]'^  is  the  n- 
dimensional  vector  of  updated  unknown  parameters,  r){k)  >  0  is  a  learning 
rate  (step  size),  and  g(A:)  =  g{0{k),x{k),y{k))  =  [9i{k),g2(k), . . .  ,gn{k)f 
is  a  nonlinear  function  depending  on  0{k)  and  x(k),y{k)  (input  and  output 
signals  respectively). 

In  system  identification  problems  for  example,  the  algorithm  defined  by 
Eq.  (1)  is  termed  “supervised”  due  to  the  availability  of  an  instantaneous 
output  error  or  desired  output  signal.  In  such  cases,  g(fc)  is  a  function  of  the 
error  and  can  be  considered  as  the  instantaneous  gradient  (which  is  a  rough 
approximation  of  true  gradient)  of  a  loss  (also  called  cost,  error  or  energy) 
function  J(0),  i.e.: 


m  =  ^0j{o,k)  = 

However  in  many  applications  like  blind  separation  of  sources,  the  true  loss 
function  J  (0)  is  not  available  or  its  evaluation  is  too  time  consuming.  More¬ 
over,  if  g{k)  is  not  a  function  of  the  error,  the  algorithm  is  termed  “unsuper¬ 
vised”  . 

It  is  well  known  that  the  final  misadjustment  (often  defined  in  terms  of  the 
mean  square  error  MSE)  increases  as  the  learning  rate  rj  increases.  However, 
the  convergence  time  increases  as  the  learning  rate  decreases  [Ij.  For  this 
reason  it  is  often  assumed  that  the  learning  ..rate  g  is  a  very  small  positive 
constant,  either  fixed  or  exponentially  decreasing  to  zero  as  time  goes  to 
infinity.  Such  an  approach  leads  to  relatively  slow  convergence  speed  and/or 
low  performance  and  is  not  suitable  for  non-stationary  environments. 

This  inherent  limitation  of  on-line  adaptive  algorithms  represented  by  (1) 
imposes  a  compromise  between  two  opposing  fundamental  requirements  of 
small  misadjustment  and  fast  convergence  demanded  in  most  applications, 
especially  in  non-stationary  environments  [1,  12].  As  a  result,  it  is  desir¬ 
able  to  find  an  alternative  method  to  improve  the  performance  of  such  algo¬ 
rithms-  Most  of  the  works  have  been  devoted  to  variable  step  size  LMS-type 
(supervised  delta  rule)  algorithms  [11,  15].  In  this  paper  we  will  consider 
a  more  general  case  which  includes  unsupervised  and/or  supervised  on-line 
algorithms. 

2  SELF-ADAPTIVE  VARIABLE  STEP  SIZE 
ALGORITHMS 

Amari  in  his  fundamental  work  [1]  analyzed  the  dynamic  behavior  of  0(k)  in 
the  neighborhood  of  the  optimal  and  obtained  analytical  formulas  for  the 
convergence  speed  to  0*  and  the  fluctuation  of  0{k)  around  0*.  Moreover,  he 
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proposed  an  efficient  method  for  a  variable  learning  rate  [1].  In  this  paper  we 
extend  Amari’s  idea  of  learning  of  learning  rate  by  proposing  robust,  variable 
step-size  on-line  algorithms.  The  main  objective  of  this  paper  is  to  develop 
a  simple  and  efficient  algorithm  which  enables  automatic  updating  of  the 
learning  rate.  This  is  especially  suited  to  non-stationary  environments. 

Recently,  on  the  basis  of  Amari’s  works  we  have  developed  a  new  family  of 
on-line  algorithms  with  an  adaptive  learning  rate  suitable  for  non-stationary 
environments  [8,  9]: 

e{k  -hi)  =  e{k)  -  7?(Ar)g(fc),  (3) 

g(A;)  =  {l-/92)g(fc-l)  +  P2gW,  (4) 

r)(k)  =  {I  -  pi)r]{k  -  1) Pil3'ip(\\g(k)\\)  (5) 

or  [1  -  pirjik  -  iMk  -  1)  -h  piMk  -  l)^(||gW||),  (6) 

where  0  <  pi  <  1,  0  <  /)2  <  1,  /?  >  0  are  fixed  coefficients  and  V^(||g(A:)||)  is  a 
nonlinear  function  defined,  e.g.,  as  V'dlgC^)!!)  =  ^  J2i=i  V^(l|g(^)ll)  = 

tanh(iX)r=i5i(fc)^)>  with  ^i(O)  =  ^*(0).  It  should  be  noted  that  Eq.  (6) 
Xlhas  been  obtained  from  Eq.  (5)  by  simply  replacing  the  fixed  pi  by  a  self- 
adaptive  term  pivik  -  1).  The  above  algorithms  are  related  (especially  Eq. 
(6))  to  the  very  recent  research  works  of  Murat  a  et  al  and  Sompolinsky  et  al 
[12,  13]).  It  is  interesting  to  note  that  Eqs.(4)-(5)  describe  simple  first-order 
low-pass  filters  (LPFs)  with  cut-off  frequencies  determined  by  the  parameters 
pi  and  p2.  The  even  nonlinear  function  ^  is  introduced  to  limit  the  maxi¬ 
mum  value  of  the  gradient  norm  ||g(fc)||  and  maximal  value  of  the  gain  (3  is 
constrained  to  ensure  stability  of  the  algorithm  [8,  9]. 

It  should  be  noted  that  for  fixed  r)  the  system  behaves  in  such  a  way  that 
parameters  9i{k)  never  achieve  steady  state  but  will  fluctuate  around  some 
equilibrium  point.  In  order  to  reduce  such  fluctuations  we  have  employed 
low-pass  filters.  Intuitively,  the  self-adaptive  system  described  by  Eqs.(2)-(5) 
operates  as  follows.  If  gradient  components  gi(k)  have  local  average  (mean) 
values  which  differ  from  zero  (which  are  extracted  by  the  first  LPF’s)  then  the 
learning  rate  rj{k)  is  decreasing  or  increasing  to  a  certain  value  determined 
by  gain  /?  and  norm  of  gradient  components.  However,  during  the  learning 
process  |pi(fc)|  decreetses  to  zero,  i.e.  after  some  time  gi{k)  starts  to  fluctuate 
around  zero,  then  gik)  decreases  also  to  zero  exponentially  as  desired.  If 
some  rapid  changes  occur  in  the  system  then  |pt(/:)|  suddenly  increases  and 
consequently  r]{k)  also  rapidly  increases  from  a  small  value  so  that  the  system 
is  able  automatically  to  adapt  quickly  to  new  environments. 

The  learning  algorithm  (3)- (6)  employs  global  learning  rate,  i.e.  the  same 
variable  step  size  r](k)  for  all  the  weights  6i.  In  order  to  improve  performance 
we  can  use  local  learning  rates,  i.e.  each  parameter  Oi  can  have  an  individual 
learning  rate  rji{k)  as  follows  [8]: 

0i{k  H- 1)  =  6i{k)  -  T)i(k)gi(k), 

r]i(k)  =  (1  -  pi)T)i{k  -  1)  +  Pil3ip{\gi{k)\) 

or  r)i{k  -  1)  +  piViif^  “  1)  W{\9i{^)\)  “  “  1)] : 


m{k)  =  {1- P2)9i{k-1)  + piQiik). 

We  have  found  that  these  algorithms  work  with  high  efficiency  and  they 
are  easily  implemented  in  VLSI  technology.  However,  they  require  the  proper 
selection  of  three  parameters  {pi,p2,l3).  Although  the  above  algorithms  are 
robust  with  respect  to  the  values  of  these  parameters,  their  optimal  choice  is 
problem  dependent.  Moreover,  the  algorithm  can  still  be  too  slow  for  some 
applications.  In  order  to  increase  the  convergence  speed  and  improve  overall 
performance  we  propose  in  this  paper  a  new  learning  strategy  which  is  based 
on  the  modified  conjugate  gradient  (CG)  approach.  We  would  like  to  empha¬ 
sis  here,  that  although  CG  techniques  are  very  well  known  and  have  been  ap¬ 
plied  successfully  to  many  optimization  problems,  most  of  these  applications 
are  related  to  batch  and  supervised  learning  algorithms.  To  our  knowledge 
the  CG  approach  has  not  been  deeply  investigated  for  on-line  adaptive  algo¬ 
rithms,  especially,  for  unsupervised  algorithms  when  cost  (loss)  functions  are 
not  explicitly  available.  In  other  words,  the  problem  is  formulated  as  follows: 
on  basis  of  available  on-line  learning  rule  it  is  necessary  to  develop  efficient 
algorithms  (self-adaptive  systems)  which  provide  self-adaptive  adjustment  of 
learning  rate  and  ensure  high  convergence  speed  and  small  misadjustment 
error. 

In  related  works  other  modified  conjugate  gradient  algorithms  have  been  pro¬ 
posed  for  adaptive  filtering  (see  e.g.  [6]).  However,  these  proposals  are  lim¬ 
ited  to  supervised  LMS  type  adaptive  algorithms.  In  this  paper  we  propose 
a  more  general  approach  which  can  applied  to  any  kind  of  on-line  algorithms 
in  the  form  of  Eq.(l)  without  the  knowledge  of  the  Hessian  or  line  search. 


3  A  MODIFIED  CONJUGATE  GRADIENT 
ADAPTIVE  ALGORITHM 


In  this  section  we  propose  a  new  scheme  which  extends  our  previous  algo¬ 
rithms  (2)-(6)  [8].  Consider  first  the  standard  CG  algorithm  [10,  6]  shown 
below,  and  which  is  suitable  only  for  batch  learning  problems  since  the  cost 
function  J  and  its  exact  gradient  g  must  be  available: 


with 


and 


^(A:-t-l)  = 
d{k)  = 
n{k)  = 

a{k)  = 


or 


0{k)  +  n{k)d{k), 

-g(fc)  +  a(fc)d(fc-l), 

argmin  J’(e(fc)  +  Tid{k)) 
n 

(g(fc)-g(fc-i)rg(fc) 

g(k  -  l)'^g{k  -  1) 
g(fc)^g(fc) 
g(k  -  l)^g(fc  -  1) 


(7) 

(8) 

(LS  procedure). 

(9) 

(PR  formula). 

(10) 

(FR  formula),  (11) 


where  in  this  case  k  is  an  index  in  parameter  space,  d{k)  is  a  search  direction, 
LS  means  “line-search”,  PR  means  “Polak-Ribiere” ,  FR  means  “Fietcher- 
Reeves”.  It  can  be  shown  that  if  the  cost  function  is  a  n-order  quadratic 
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function,  this  CG  algorithm  finds  the  minimum  in  n  steps.  In  extensions  of 
CG  to  non-quadratic  functions,  a{k)  is  reset  to  zero  every  R  (typically  R  =  n) 
iterations  in  order  to  improve  the  rate  of  convergence  near  the  minimum  [10]. 
Consequently  in  the  case  of  such  a  “restart”,  Eq.  (8)  becomes: 

d(fe  +  l)  =  -g(fc  +  l).  (12) 

Unfortunately,  the  above  algorithm  cannot  be  applied  directly  to  learning 
rule  (1)  since  we  assume  that  the  loss  function  J^{0)  is  not  available.  In 
this  case  the  line  search  to  find  the  optimal  value  of  r}{k)  in  each  iteration 
step  can  not  be  performed.  Moreover,  the  true  value  of  gradient  g{k)  is  not 
directly  available  in  on-line  adaptive  algorithms  and  must  be  estimated  in 
each  iteration  step. 

For  this  reason  we  propose  a  modified  CG  approach  in  which  a  time  con¬ 
suming  line  search  procedure  is  replaced  by  low-pass  filtering  of  the  gradient 
norm.  Here,  we  can  use  any  norm  (e.g.  Euclidean  norm)  which  converts  a 
vector  to  a  scalar  (see  Eqs.(3)-(6)).  In  other  words,  instead  of  line  search  we 
have  applied  a  simple  low-pass  (averaging)  technique  to  estimate  the  learning 
rate  r){k). 

The  parameters  in  the  update  recursion  are  given  by  (see  Fig.l  (a)): 

e{k-^l)  =  0{k)+T](k)d{k),  (13) 

d{k)  =  -g(A:)  +  a(fc)d(fc-l),  (14) 

rj(k)  =  (l~pi)r]{k~l)^ppM\\dik)\\),  (15) 

or  [1  -  pirjik  -  l)]r}{k  -  1)  +  Pil^vi^  -  l)^(l|d(A:)||),  (16) 

where  0  <  pi  <  1  is  a  small  positive  constant,  a{k)  is  an  adaptive  coefficient 
computed  on  basis  of  the  Polak-Ribiere  or  Fletcher-Reeves  formula  with  re¬ 
setting  to  zero  for  every  R  iterations  or  if  a{k)  achieves  a  value  greater  than 
a  specified  threshold  (typically  a  value  between  1  and  2).  An  estimate  g(fc) 
of  the  true  gradient  g{k)  is  computed  using  one  of  the  following  formula: 

1)  Recurrent  formula  realized  by  a  first-order  IIR  low-pass  filter  (see  Fig.l 

(b)): 

g(fc)  =  (1  -  p2)s{k  -  1)  +  P2^{k)  (17) 

2)  Sliding  window  realized  by  M-th  order  FIR  low-pass  filter  which  in  the 
special  case  performs  simple  averaging  (possibly  with  forgetting  factor  7) 
(see  Fig.l  (c)): 

-  M-l 

sik)  =  77  Z)  ')'’«('=  -  *■) 

i=0 

The  on-line  learning  algorithm  (13)-(18)  has  useful  circuit  and  signal  process¬ 
ing  interpretations  (see  Fig.l  (a),  (b)  and  (c)).  Equations  (14)-(17)  could  be 
considered  as  first  order  low-pass  discrete-time  filters.  From  these  interpreta¬ 
tions  many  possible  extensions  or  generalizations  follow.  For  instance,  instead 
of  first-order  low-pass  filters  we  could  employ  second  or  higher  order  IIR  or 


FIR  filters.  Moreover,  coefficients  pi,  p2  and  (1  —  a{k))  can  be  interpreted 
as  parameters  corresponding  to  cut-off  frequency  of  low  pass  filters.  Instead 
of  filter  with  fixed  cut-off  frequencies  we  could  use  adaptive  filters  with  ad¬ 
justable  cut-off  frequency,  e.g.  we  could  assume  that  pi{k)  =  Pir]{k  —  1)  (see 
Eq.  (6)  [8,  12]). 

Our  new  algorithm  (13)-(18)  offers  the  following  advantages:  (i)  high 
convergence  speed  due  to  the  optimization  based  on  the  conjugate  ^;j;radient 
approach;  (ii)  low  complexity  due  to  the  replacement  of  a  costly  line  search 
procedure  by  a  much  simpler  computation  of  the  step  size  (Eq.  (15));  (iii) 
adaptive  learning  rate  suitable  for  non-stationary  environments. 

4  COMPUTER  SIMULATION  EXPERI¬ 
MENTS 

In  order  to  confirm  the  validity  and  the  performance  of  the  developed  on-line 
adaptive  algorithms  we  have  tested  them  on  a  number  of  specific  problems, 
in  particular  blind  equalization/deconvolution  and  blind  separation  problems 
[3,  4,  8].  Due  to  lack  of  space  we  give  here  only  an  illustrative  example  for 
blind  separation  of  instantaneous  mixture  of  sources. 

The  robust  unsupervised  on-line  learning  algorithm  for  the  standard  blind 
separation  of  m  sources  can  be  formulated  in  vector  form  [4,  8,  7,  5,  14]: 

v/i{k  +  1)=  ^i{k)  -  r}i{k)§i{k),  (19) 

where  gi{k)  =  -Wi(fc)  -I- f[y(fc)]y'^(fc)wi(A:),  w*  =  [wii,Wi2,  (i  = 

l,2,...,m),y(A:)  =  W(k)x(k),  x(k)  =  As  (k),  f(y)  =  [wi,  W2, ...,  w,n] 

In  this  formulation  there  are  m  local  learning  rates  ??*;  a  simplified  formula¬ 
tion  consists  in  using  only  one  global  learning  rate  77,  so  that  we  obtain  an 
on-line  algorithm  of  the  form  (1)  as 

0(k  +  l)  =  0(k)-7j(k)g(k),  (20) 

where  concatenated  vectors  are  defined  as  0(k)  ~  [wf  (A:),  (fc), . . . ,  w'^(k)]'^ 
and  g(k)  =  [g[ {k),g2  (k), . . .,  gm(^)]^-  Thus  our  learning  schema  (3)-(6)  and 
(13)-(15)  can  be  directly  applied  to  this  problem.  The  key  aspect  of  the  prob¬ 
lem  we  consider  is  that  the  elements  of  the  mixing  matrix  A  are  not  fixed 
but  change  rapidly  or  drift  slowly  over  time. 

Illustrative  example:  we  consider  the  following  synthetic  source  signals 
(assumed  to  be  unknown  to  the  algorithm)  sampled  with  sampling  period 
At  =  2-  10-^s: 

si{k)  =  sin(2007rA:Ai)  cos(3007rA:Ai),  ^ 

S2(k)  =  sign(cos(4007rA:At -(- 50sin(307rA:Af))) 

First  experiment  (see  fig.  2  (a)):  Sensors  signals  are  obtained  by  mixing 
the  source  signals  with  examplar  mixing  matrix  Ai  during  the  first  second 
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Figure  1:  Functional  block  diagram  illustrating  hardware  implementation  of 
the  new  learning  schemas:  (a)  Eqs.  (13)-(15);  (b)  Eqs  (13), (14), (16).  (c) 
HR  first-order  low-pass  filter;  (d)  Standard  averaging  with  sliding  window, 
with  null  initial  conditions.  For  a(k)  =  0,  the  learning  rules  simplify  to  Eqs 


(a)  HR  average  gradient 


FIR  average  gradient 


Figure  2:  (a)  Comparison  of  performance  indexes  obtained  with  gradient  de¬ 
scent  (Eqs.  (2)-(5))  and  conjugate  gradient  (Eqs.  (13)-(18)).  Both  algorithms 
use  the  adaptive  learning  rate  of  Eq.  15,  with  parameters  /3  =  0.005, 77(0)  = 
0.01,  p2  =  0.01;  restarts  are  performed  every  R  =  4  iterations  (conjugate 
gradient  case).  Upper  figure:  the  gradient  is  estimated  using  an  HR  filter  of 
parameter  pi  =  p2-  Lower  figure:  the  gradient  is  estimated  using  a  FIR  filter 
of  parameter  M  =  100.  (b)  Weights  and  learning  rate  for  Eqs.  (13)-(15)  and 
HR  filter,  in  case  of  a  double  change  of  mixing  matrix. 
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and  with  matrix  A2  =  Af  during  the  next  second.  Second  experiment  (see 
fig.  2  (b)):  mixing  with  Ai  during  0.8s,  A2  =  A^  during  the  next  0.8s,  and 
again  with  Ai .  In  order  to  assess  the  performance  of  our  algorithm  we  use 
the  following  normalized  performance  index  PI,  which  provides  a  measure 
of  crosstalking  or  estimation  of  the  closeness  between  W“^  and  the  desired 
mixing  matrix  A  while  taking  into  consideration  the  indeterminacy  (scaling 
and  order  of  the  estimated  sources)  inherent  to  the  blind  separation  problem 
[4,  8]: 


r. 


^  m  I  m 


i=i  1 1=1 


\pii? 

max,  |pi,P 


biiP 

max,  \pqj\^ 


where  P(A;)  =  \pij]  =  W{k)A, 

Computer  simulation  results  show  that  the  proposed  algorithm  has  good 
dynamic  convergence  and  tracking  capabilities.  On  basis  of  intensive  com¬ 
puter  simulation  experiments  we  have  found  that  the  new  algorithm  with 
Fletcher  -  Reeves  formula  with  resetting  of  a{k)  every  R  ^  n  {n  =  m^, 
number  of  parameters)  iterations  or  if  a{k)  achieves  a  value  greater  than 
1,  provided  better  performance  as  compared  to  Polak-Ribiere  formula  or  al¬ 
gorithms  given  by  Eqs.  (2)-(6).  Furthermore,  averaging  with  sliding  time 
window  by  M  from  50-100  provides  slightly  better  performance  than  using 
first  order  HR  low-pass  filter. 


5  CONCLUSIONS 

A  new  class  of  on-line  adaptive  learning  algorithms  with  a  variable  step  size 
is  proposed.  In  our  algorithms  the  learning  rate  is  adjusted  depending  on 
the  value  of  gradient  norm  or  strictly  speaking  low-pass  filtered  version  of 
the  norm  of  actual  search  direction.  The  low-pass  filtering  process  is  op¬ 
timized  by  employing  a  conjugate  gradient  approach.  In  contrast  to  other 
conjugate  gradient  approaches  which  require  knowledge  of  a  cost  function, 
our  algorithms  can  be  applied  either  when  a  cost  function  is  available  (su¬ 
pervised  case)  or  not  (unsupervised  case).  In  this  paper  we  have  presented 
an  application  to  the  problem  of  blind  separation  of  sources  illustrating  the 
applicability  of  the  algorithm  to  the  unsupervised  case. 

The  main  advantages  are  that  the  proposed  algorithm  adapts  with  high  con¬ 
vergence  speed  to  a  fast  or  slow  change  of  the  system  and  also  produces  a 
small  final  misadjustment  error.  This  allows  the  adaptive  system  to  track 
slow  changes  or  to  adapt  relatively  quickly  to  abrupt  changes,  as  well  as,  to 
produce  a  small  steady  state  misadjustment. 
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Abstract.  A  method  for  the  analysis  of  nonstationary  time  series 
with  multiple  operating  modes  is  presented.  In  particular,  it  is  pos¬ 
sible  to  detect  and  to  model  a  switching  of  the  dynamics  and  also 
a  less  abrupt,  time  consuming  drift  from  one  mode  to  another.  This 
is  achieved  by  an  unsupervised  algorithm  that  segments  the  data 
according  to  inherent  modes,  and  a  subsequent  search  through  the 
space  of  possible  drifts.  An  application  to  physiological  wake/sleep 
data  demonstrates  that  analysis  and  modeling  of  real-world  time  se¬ 
ries  can  be  improved  when  the  drift  paradigm  is  taken  into  account. 
In  the  case  of  wake/sleep  data,  we  hope  to  gain  more  insight  into  the 
physiological  processes  that  are  involved  in  the  transition  from  wake 
to  sleep. 


1  Introduction 

Modeling  dynamical  systems  through  a  measured  time  series  is  commonly 
done  by  reconstructing  the  state  space  with  time-delay  coordinates  [11,  16], 
The  prediction  of  the  time  series  can  then  be  accomplished  by  training  neu¬ 
ral  networks  [17].  If,  however,  a  system  operates  in  multiple  modes  and  the 
dynamics  is  drifting  or  switching^  standard  approaches  like  multi-layer  per- 
ceptrons  are  likely  to  fail  to  represent  the  underlying  input-output  relations. 
Moreover,  they  do  not  reveal  the  dynamical  structure  of  the  system.  Time 
series  from  alternating  dynamics  can  originate  from  many  kinds  of  systems 
in  physics,  biology  and  engineering.  Phenomena  of  this  kind  are  observed, 
for  example,  in  speech  [15],  brain  data  [10,  12],  or  dynamical  systems  which 
switch  attractors  [3]. 

In  [5,  9, 13],  we  have  described  a  framework  for  time  series  from  switching 
dynamics,  in  which  an  ensemble  of  neural  network  predictors  specializes  on 
the  respective  operating  modes.  Related  approaches  can  be  found  in  [1,  18]. 
We  now  extend  the  ability  to  describe  a  mode  change  not  only  as  a  switching 
but  -  if  appropriate  -  also  as  a  drift  from  one  predictor  to  another.  Our 
results  indicate  that  physiological  signals  contain  drifting  dynamics,  which 
underlines  the  potential  relevance  of  our  method  in  time  series  analysis. 
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2  Detecting  Drifts  in  the  Dynamics 


The  detection  and  analysis  of  drifts  is  performed  in  two  steps.  First,  an 
unsupervised  (hard-) segmentation  method  is  applied.  It  was  first  presented 
in  [4].  In  this  approach,  an  ensemble  of  prediction  experts  /»,  i  =  I,  ...,iV,  is 
trained  by  maximizing  the  likelihood  that  the  ensemble  would  have  generated 
the  time  series.  For  the  derivative  of  the  log-likelihood  with  respect  to  the 
output  of  an  expert,  we  get  (cf.  [5]) 


^logL 


Q-0(y-fif 


{y  -  fi), 


(1) 


where  2/  is  a  data  point  to  be  predicted.  This  learning  rule  can  be  interpreted 
as  a  weighting  of  the  learning  rate  of  each  expert  by  the  expert’s  relative  pre¬ 
diction  performance.  It  is  a  special  case  of  the  Mixtures  of  Experts  [2]  learning 
rule,  with  the  gating  network  being  omitted.  Furthermore,  we  imposed  a  low- 
pass  filter  on  the  prediction  errors  and  used  deterministic  annealing  in  the 
training  process  (see  [5,  13]  for  details).^ 

As  a  prerequisite  of  this  method,  mode  changes  should  occur  infrequent, 
i.e.  between  two  mode  changes  the  dynamics  should  operate  stationary  in 
one  mode  for  a  certain  number  of  time  steps.  Applying  this  method  to  a  time 
series  yields  a  (hard)  segmentation  of  the  series  into  different  operating  modes 
together  with  prediction  experts  for  each  mode.  In  case  of  a  drift  between 
two  modes,  the  respective  segment  tends  to  be  subdivided  into  several  parts, 
because  a  single  predictor  is  not  able  to  handle  the  nonstationarity. 

The  second  step  takes  the  drift  into  account.  A  segmentation  algorithm 
is  applied  that  allows  to  model  drifts  between  two  stationary  modes  by  com¬ 
bining  the  two  respective  predictors,  fi  and  fj.  The  drift  is  modeled  by  a 
weighted  superposition 


f{xt)  =  a{t)  fi{xt)  +  (1  -  a{t))  fj{x.t),  0  <  a{t)  <  1,  (2) 

where  a{t)  is  a  mixing  coefficient  and  Xf  =  (xt,xt-Ti  •  • .  ^ Xt-(m-i)T)'^  is  the 
vector  of  time-delay  coordinates  of  a  (scalar)  time  series  {xf}.  Furthermore, 
m  is  the  embedding  dimension  and  r  is  the  delay  parameter  of  the  embedding. 
Note  that  the  use  of  multivariate  time  series  is  straightforward. 

The  drift  segmentation  algorithm  performs  a  complete  search  for  the  op¬ 
timal  segmentation  with  the  lowest  average  prediction  error  ^d  takes  the 
described  drift  into  account.  The  search  is  performed  in  the  following  way: 
For  a  given  time  series,  which  is  not  necessarily  the  training  set  used  in  the 
training  phase,  each  of  the  previously  trained  experts  performs  a  prediction 
for  every  time  step,  which  results  in  a  matrix  of  expert  outputs  /i(xt)  versus 
time  steps  t.  This  matrix  can  then  be  used  to  compute  the  mean  prediction 
errors  for  arbitrary  segmentations  of  the  time  series,  including  drifts  of  any 
length  and  shape.  The  best  segmentation  with  the  lowest  prediction  error  can 
be  obtained  efficiently  by  dynamic  programming. 

^  Further  information  and  papers  can  be  found  at: 
http:  //ww.  first  .gmd.de/persons/Kohlinorgen.  Jens  .html 
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2.1  The  Drift  Segmentation  Algorithm  in  Detail 

Consider  a  set  P  of  ‘pure’  states  (dynamical  modes).  Each  state  s  £  P  repre¬ 
sents  a  single  neural  network  k{s),  which  solely  performs  a  prediction.  Next, 
consider  a  set  M  of  ‘mixed’  states,  where  each  state  s  e  M  represents  a  linear 
mixture  of  two  nets  i{s)  and  ^(5).  Then,  given  a  state  s  e  S,S  =  PU  M,  the 
prediction  of  the  overall  system  is  performed  by 


a  (jct)  —  / 5  ^ 


;  if  5  e  P 
:  if  s  G  M 


For  each  mixed  state  s  e  M,  the  coefficients  a(s)  and  b(s)  have  to  be  set 
together  with  the  respective  network  indices  z(s)  and  j(s).  For  computational 
feasibility,  the  number  of  mixed  states  has  to  be  restricted.  Our  intention 
is  to  allow  for  drifts  between  any  two  network  outputs  of  the  previously 
trained  ensemble.  We  choose  a(s)  and  b{s)  such  that  0  <  a(s)  <  1  and 
b{s)  =  1  —  a(s).  Moreover,  the  algorithm  only  allows  for  a  discrete  set  of 
mixed  states.  Consequently,  a  discrete  set  of  a{s)  values  has  to  be  defined. 
For  simplicity,  equally  distant  steps  can  be  chosen. 


ar  =  -^^,r  =  l,...,R.  (4) 

R  is  the  number  of  intermediate  mixture  levels.  A  given  resolution  R 
between  any  two  out  of  N  nets  yields  a  total  number  of  mixed  states  |M|  = 
R  •  N  •  {N  —  l)/2.  For  example,  in  this  paper  the  resolution  P  =  32  is  used. 
Assume  N  =  8,  then  there  are  \M\  =  896  mixed  states,  plus  |P|  =r  AT  =  8 
pure  states  where  only  single  nets  are  considered. 

A  dynamic  programming  technique,  equivalent  to  the  Viterbi  algorithm 
for  Hidden  Markov  Models  (HMM)  [15],  efficiently  yields  the  sequence  of  nets 
and  linear  mixtures  of  nets  with  the  lowest  prediction  cost  C*.  C*  is  the  sum 
of  squared  prediction  errors  plus  transition  costs  for  the  best  fitting  sequence. 
This  sequence  can  be  obtained  between  two  points  in  time,  to  and  tmax,  by 
recursively  computing,  for  all  5  G  5,  the  cost  C's(t)  of  the  most  likely  state 
sequence  that  might  have  produced  the  time  series  and  whose 

state  at  time  t  is  s: 


Cs{to)  =es{to), 


(5) 


Cs{t)  =  £s{t)  +  min  ^ 

},  t  =  to  +  l,.. 

• » ^max )  (6) 

ies 

(7) 

where 

es{t)  =  (xt  -  9s{^t-r))^ 

(8) 
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is  the  squared  prediction  error  of  the  pure  or  mixed  network  output,  and 
T(s,s)  is  the  transition  cost  to  jump  from  state  s  to  state  s.  Note  that  the 
transition  costs  are  in  analogy  to  the  transition  probabilities  in  HMMs,  i.e. 
the  choice  of  the  transition  matrix  T  determines  the  transition  probability 
between  any  two  states.  In  this  way,  a  priori  knowledge  about  the  problem 
can  be  incorporated.  In  the  following  applications,  either  switches  or  smooth 
drifts  between  two  nets  are  allowed,  all  other  possible  transitions  are  disabled 
by  setting  T(s,  s)  =  oo. 

The  resulting  segmentation  sequence  is  obtained  by  backtracking  through 
the  sequence  of  states  that  make  up  C*  (cf.  [15]).  In  the  context  of  HMMs, 
this  segmentation  can  be  interpreted  as  the  most  likely  state  sequence  that 
could  have  generated  the  given  time  series,  in  our  case  with  the  additional 
assumption  that  mode  changes  occur  either  as  (smooth)  drifts  or  as  infrequent 
switches. 

3  Applications 

To  illustrate  the  basic  idea  of  this  approach,  a  simple  example  of  drifting 
chaotic  dynamics  is  discussed  first.  It  is  followed  by  an  application  to  a  drift¬ 
ing  system  of  the  Mackey-Glass  model  of  blood  cell  regulation.  Finally,  an 
application  to  real-world  data  is  presented:  EEG  data  of  an  afternoon  nap  of 
a  human. 


3.1  Drifting  Chaos 

Consider  a  chaotic  time  series  {a;*},  where  xt+i  =  f(xt)i  Fig.  1(a).  Four  major 
operating  modes  are  established  by  using  four  different  chaotic  maps: 

/i(a;)  =  4a:(l  -  x) ,  a:  €  [0, 1]  (logistic  map) 

f2{x)  =  /i(/i(a:))'  (double  logistic  map) 

fs{x)  =  2a:,  if  a:  €  [0,  .5)  and  2(1  -  a:),  if  a:  €  [.5, 1]  (tent  map) 

A (ic)  =  fsif 3 (a^) )  (double  tent  map) 

For  the  first  50  time  steps,  /i  is  applied  recursively,  starting  with  xq  = 
0.5289.  After  t  =  b0  time  steps,  the  dynamics  is  drifting  from  /i  to  A  using 

f{xt)  =  {1  -  a{t))  fi{xt)  +  o(t)/2(xt),  Q(t)  =  I  _  /  ,  (9) 

t-b  ta 

with  ta  =  50  and  tf,  =  100.  The  drift  is  linear  in  time  and  takes  another  50 
time  steps.  Then,  the  system  runs  stationary  in  mode  A  for  the  following  50 
time  steps,  whereupon  it  is  drifting  to  /s  in  the  same  fashion  as  before,  and 
so  on.  At  t  =  350,  the  system  starts  to  drift  back  from  A  to  fi  and  the  cycle 
starts  again  at  t  =  400. 

In  the  resulting  time  series  one  cannot  determine  the  appropriate  contin¬ 
uation  xt+i ,  given  only  xt  and  no  information  about  the  operating  mode.  In 
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Pig.  1.  (a)  A  part  of  the  training  data,  generated  by  the  chaotic  return  maps  fi  and 
/4.  First,  /4  15  iterated  from  t  =  300  f  =  350.  Then,  there  is  a  drift  to  fi  between 
t  —  350  and  t  =  400.  After  t  =  400,  fi  is  iterated,  (b)  The  final  segmentation 
into  training  subsets,  obtained  by  the  competitive  training  procedure.  Shown  are  the 
first  450  data  points.  This  segmentation  cannot  represent  the  drift.  The  stationary 
parts,  fi  in  [0,50]  and  [400,450],  /2  in  [100,150],  /a  in  [200,250],  /4  in  [300,350], 
are  predicted  by  nets  6,  2,  4,  and  3,  respectively.  The  nonstationary  drift  parts  in 
between  are  shared  among  all  predictors,  including  nets  1  and  5. 


principle,  one  way  to  solve  this  problem  is  the  method  of  time-delay  embed¬ 
ding  [11].  In  this  case,  however,  the  inclusion  of  such  kind  of  memory,  e.g. 
using  also  xt-i,  leads  to  a  very  complex  prediction  function  [14].  Moreover, 
using  a  single  network  for  prediction  does  not  reveal  the  dynamical  struc¬ 
ture  of  the  system.  An  adequate  representation  of  the  underlying  relations 
should  therefore  contain  a  division  into  subtasks,  as  it  is  performed  within 
our  framework. 

First,  the  competing  experts  approach  [5,  9, 13]  is  applied  to  the  first  1200 
data  points  of  the  generated  time  series.  An  ensemble  of  6  predictors  fi{xt), 
i  =  1,  ...,6,  competes  for  the  data  during  the  training  phase.  We  use  radial 
basis  function  (RBF)  networks  of  the  Moody-Darken  type  [8]  as  predictors, 
because  they  offer  a  fast  and  robust  learning  method.  Each  predictor  has 
20  basis  functions.  After  training,  four  predictors  have  specialized  each  on  a 
different  chaotic  map,  and  the  other  two  predictors  tried  to  specialize  on  the 
drift  parts.  This  can  be  observed  in  the  final  segmentation  of  the  competition 
procedure,  shown  in  Fig.  1(b). 

Next,  the  drift  segmentation  algorithm  is  applied  to  all  six  networks.  It 
perfectly  reproduces  the  behavior  of  the  dynamics,  as  seen  in  Fig.2(a)  for  the 
resolution  i?  =  32:  a  linear  drift  between  four  stationary  operating  modes  is 
obtained.  Fig.2{b)  is  included  to  demonstrate  the  effect  of  a  lower  resolution. 
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Fig.  2.  (a)  The  segmentation  obtained  by  the  drift  algorithm  on  the  test  data 
[1200,2400],  using  the  resolution  R  =  32,  Shown  is  the  sequence  of  nets  as  a  funC’ 
tion  of  time.  The  dotted  line  indicates  the  evolution  of  the  mixing  coefficient  a{t)  of 
the  respective  nets.  For  example,  between  t  =  1350  and  1400  it  denotes  a  drift  from 
net  2  to  net  4,  which  in  this  case  turns  out  to  be  a  linear  drift,  as  expected.  The 
segmentation  almost  perfectly  reproduces  the  behavior  of  the  dynamical  system,  (b) 
A  segmentation  with  a  low  resolution,  R  =  Z,  traverses  the  drift  parts  in  3  steps. 


3.2  A  Drifting  Mackey-Glass  System 

Consider  a  high-dimensional  chaotic  system  generated  by  the  Mackey-Glass 
delay  differential  equation 


0.2a;(t  -  td) 


(10) 


It  was  originally  introduced  as  a  model  of  blood  cell  regulation  [7],  Two 
stationary  operating  modes,  A  and  B,  are  established  by  using  different  de¬ 
lays,  td  =  17  and  23,  respectively.  After  operating  100  time  steps  in  mode  A 
(with  respect  to  a  subsampling  step  size  r  =  6),  the  dynamics  is  drifting  to 
mode  B.  The  drift  takes  another  100  time  steps.  It  is  performed  by  mixing  the 
equations  for  td  =  17  and  23  during  the  integration  of  eq.(lO).  The  mixture 
is  generated  according  to  eq.(2),  using  an  exponential  drift 


a{t)  =  exp 


t  =  h 


100. 


(11) 


331 


(c)  (d) 


Fig.  3.  (a)  The  drifting  Mackey-Glass  time  series.  The  dynamical  system  operates 
in  mode  A  for  the  first  100  time  steps.  Then,  the  dynamics  is  drifting  to  mode 
B  during  the  next  100  steps,  and  remains  stationary  in  B.  After  t  —  300,  the 
system  switches  back  to  mode  A  and  the  cycle  starts  again,  (b)  The  resulting  drift 
segmentation  invokes  four  nets.  This  is  because  two  nets  became  experts  for  mode 
A,  and  two  others  for  mode  B.  (c)  Increase  of  the  prediction  error  when  predictors 
are  successively  removed.  Although  no  further  training  has  been  performed,  up  to 
four  predictors  can  be  removed  without  a  significant  increase  of  the  prediction  error, 
(d)  The  two  remaining  predictors  model  the  dynamics  of  the  time  series  properly. 


Then,  the  system  runs  stationary  in  mode  B  for  the  following  100  time 
steps,  whereupon  it  is  switching  back  to  mode  A  at  t  =  300,  and  the  loop 
starts  again  (Fig.3(a)).  The  competing  experts  algorithm  is  applied  to  the 
first  1500  data  points  of  the  generated  time  series,  using  an  ensemble  of  6 
predictors  /t(xt),  i  =  1,...,6.  The  input  to  each  predictor  is  a  vector  of 
time-delay  coordinates  of  the  scalar  time  series  {ar*}.  The  embedding  dimen¬ 
sion  is  m  =  6  and  the  delay  parameter  is  r  =  1  on  the  subsampled  data.  The 
RBF  predictors  consist  of  40  basis  functions  each. 

After  training,  nets  2  and  3  have  specialized  on  mode  A,  nets  5  and  6  on 
mode  B.  This  can  be  seen  in  the  drift  segmentation  in  Fig.3(b).  Moreover,  the 
removal  of  four  nets  does  not  increase  the  root  mean  squared  error  (RMSE) 
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of  the  prediction  significantly  (Fig.3(c)),  which  correctly  indicates  that  two 
predictors  completely  describe  the  dynamical  system.  The  sequence  of  nets  to 
be  removed  is  obtained  by  repeatedly  computing  the  RMSE  of  all  n  subsets 
with  n  —  \  nets  each,  and  selecting  the  subset  with  the  lowest  RMSE  of  the 
respective  drift  segmentation.  The  segmentation  of  the  remaining  nets,  2  and 
5,  nicely  reproduces  the  evolution  of  the  dynamics,  as  seen  in  Fig.3(d). 


3.3  Wake/Sleep  Data 

In  [10],  we  analyzed  physiological  data  recorded  from  the  wake/sleep  tran¬ 
sition  of  a  human.  The  objective  was  to  provide  an  unsupervised  method 
to  detect  the  sleep  onset  and  to  give  a  detailed  approximation  of  the  sig¬ 
nal  dynamics  with  a  high  time  resolution,  ultimately  to  be  used  in  diagnosis 
and  treatment  of  sleep  disorders.  The  application  of  the  drift  segmentation 
algorithm  now  yields  a  more  detailed  modeling  of  the  dynamical  system. 

As  an  example,  Fig.  4  shows  a  comparison  of  drift  segmentation  {R  —  32), 
hard  segmentation  {R  —  0,  i.e.  no  drift),  and  a  manual  segpientation  by  a 
medical  expert.  The  experimental  data  was  measured  during  an  afternoon 
nap  of  a  healthy  human.  The  computer-based  analysis  is  performed  on  a 
single-channel  EEG  recording  (occipital- 1),  whereas  the  manual  segmentation 
was  worked  out  using  six  physiological  signals  (EEG,  EOG,  EGG,  heart  rate, 
blood  pressure,  respiration). 

The  drift  algorithm  yields  several  drift  parts.  The  sleep  onset,  according 
to  the  manual  segmentation  at  f  w  4000,  is  represented  by  an  exponential 
drift  from  a  wake-state  predictor,  net  7,  to  a  sleep-state  predictor,  net  4. 
On  the  other  hand,  the  wake-up  is  introduced  at  f  «  9000  by  a  slight  drift 
back  to  net  7,  which  holds  until  the  wake-up  point  is  reached  {t  «  9500  in 
the  manual  segmentation).  There,  a  sudden  change  of  the  mixing  coefficient 
gives  more  weight  to  wake-state  net  7.  After  t  «  9800  (eyes  open),  a  mixture 
of  two  wake-state  nets,  2  and  7,  performs  the  prediction. 

Compared  to  both  hard  segmentations,  the  drift  segmentation  reveals 
several  interesting  details  of  the  dynamical  changes  in  the  transition  between 
different  wake/sleep  stages.  A  more  comprehensive  analysis  of  the  wake/sleep 
data  is  beyond  the  scope  of  this  contribution  and  we  would  like  to  refer  the 
reader  to  our  forthcoming  publication  [6]. 

4  Summary  and  Discussion 


A  method  for  the  unsupervised  segmentation  and  identification  of  nonsta¬ 
tionary  drifting  dynamics  was  presented.  It  applies  to  time  series  where  the 
dynamics  drifts  or  switches  between  different  operating  modes.  The  method 
was  illustrated  in  two  cases  of  drifting  chaotic  systems.  An  application  to 
physiological  wake/sleep  data  demonstrates  that  drift  can  be  found  in  natural 
systems.  It  is  therefore  important  to  consider  this  aspect  of  data  description. 
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Fig.  4.  Comparison  of  hard  segmentation  (upper),  drift  segmentation  (middle),  and 
a  manual  segmentation  by  a  medical  expert  (lower).  Only  a  single- channel  EEC 
recording  (occipital-1,  1400  sec.)  of  an  afternoon  nap  is  given  for  the  two  algo¬ 
rithmic  approaches.  W1  and  W2  indicate  two  wake-states  (eyes  open/closed)  in 
the  manual  analysis,  SI  and  S2  indicate  sleep  stage  I  and  II,  respectively  ("n.a.: 
not  considered,  art.:  artifacts).  Compared  to  the  manual  segmentation,  the  EEC  is 
properly  segmented  by  our  method. 


In  the  case  of  wake/sleep  data,  where  the  physiological  state  transitions 
are  far  from  being  understood,  we  are  able  to  extract  the  shape  of  the  dy¬ 
namical  drift  from  wake  to  sleep  in  an  unsupervised  manner.  By  applying  this 
new  analysis  tool,  we  hope  to  gain  more  insights  into  the  underlying  physio¬ 
logical  processes.  Our  future  work  is  therefore  dedicated  to  a  comprehensive 
analysis  of  large  physiological  datasets.  We  expect,  however,  that  our  method 
will  be  also  applicable  in  many  other  fields. 
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ABSTRACT 

Clustering  is  an  important  research  area  and  of  practi¬ 
cal  applications  in  many  fields.  Fuzzy  clustering  has  shown 
advantages  over  crisp  and  probabilistic  clustering  especially 
when  there  are  significant  overlaps  between  clusters  [1],  [2],  [7], 
[9],  [10].  However,  all  of  the  fuzzy  clustering  algorithms  are 
sensitive  to  an  exponent  parameter,  namely  the  fuzzifier.  To 
our  knowledge,  no  theoretical  foundations  are  yet  available 
for  the  optimal  choice  of  this  parameter  [2],  [3],  [4],  [5],  [6], 
[9],  [10].  The  current  work  develops  an  improved  scheme  for 
the  fuzzifier  by  embedding  more  knowledge  about  the  data 
set  to  cluster  in  its  computation. 


Keywords:  Neural  Networks,  Clustering,  Fuzzy  Set  Memberships, 
Fuzzifier. 


1  INTRODUCTION 

Clustering  is  an  important  research  area  and  of  practical  applications  in  a 
variety  of  fields,  including  pattern  recognition  and  data  compression  [11]. 
Clustering  algorithms  attempt  to  partition  the  input  data  into  groups,  i.e. 
clusters,  such  that  patterns  within  a  cluster  are  similar  to  each  other  than 
are  patterns  in  distinct  clusters  [5].  Fuzzy  clustering  algorithms  have  shown 
advantages  over  their  Crisp  /  Probabilistic  counterparts  especially  when  there 
are  significant  overlaps  between  clusters.  Investigation  of  the  motivations  for 
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introducing  the  fuzzy  set  membership  notion  within  clustering  algorithms 
goes  beyond  the  scope  of  the  current  paper.  For  a  concise  review  of  fuzzy 
clustering  schemes  and  an  experiment-based  comparison  with  their  Crisp  / 
Probabilistic  counterparts  we  refer  the  reader  to  the  literature;  e.g.  [3],  [4], 
[5],  [9],  [10],  [12]. 

By  adopting  the  fuzzy  set  membership  notion,  many  fuzzy  clustering  algo¬ 
rithms  were  proposed;  e.g.  the  Fuzzy  Learning  Vector  Quantization  (FLVQ, 
for  short)  [2],  the  Fuzzy  C-Means  (FCM,  for  short)  [5],  the  Fuzzy  C  Spher¬ 
ical  Shells  (FCSS,  for  short)  [10].  A  Fuzzy  clustering  algorithm  makes  use 
of  an  exponent  parameter  said  to  be  the  fuzzifier.  It  constitutes  the  most 
problematical  choice  for  fuzzy  clustering.  Indeed,  this  parameter  affects  the 
convergence  rate  as  well  as  the  cluster  validity  of  the  algorithm.  To  our 
knowledge  no  theoretical  foundations  for  an  optimal  choice  of  this  parameter 
are  yet  available  [2],  [3],  [4],  [5]  [6],  [9],  [10].  Instead,  an  adequate  choice  is 
always  done  via  experimentation  [3],  [4],  [5]  [6],  [9],  [10],  i.e.  this  choice  is 
still  largely  heuristic. 

A  general  rule  of  thumb  about  the  setting  of  the  fuzzifier  was  proposed 
and  analyzed  in  literature  [4],  [5],  [2].  In  the  present  paper,  we  propose  a 
solution  that  makes  use  of  this  general  rule  of  thumb.  Besides,  we  make 
closer  interactions  between  the  fuzzifier  and  the  data  set  to  cluster. 

The  remainder  of  this  paper  is  organized  as  follows.  In  Section  2  we  for¬ 
mulate  the  clustering  problem  and  give  the  main  notations  adopted  in  this 
paper.  In  section  3  we  discuss  the  problems  faced  by  an  inadequate  choice  of 
the  fuzzifier  and  outline  our  proposal  and  related  work.  In  section  4  prelim¬ 
inary  experimental  results  are  reported.  The  final  section  offers  discussions 
and  conclusions. 


2  NOTATION  AND  PROBLEM  FORMULA¬ 
TION 

Let  c  be  an  integer,  1  <  c  <  n  and  let  X  =  {xi^X2,  •  •  -  jXn}  be  a  set  of  n 
feature  vectors  in  71^ .  X  is  a  numerical  object  data,  xj  is  a  representation 
of  the  jth  object  in  X.  xjk  is  the  kth  feature  value  of  the  jth  object.  The 
number  of  clusters  can  be  specified  a  priori  or  computed  from  the  input  data 
set  using  criteria  of  optimality  such  that  the  fuzzy  hypervolume  and  density 
[6].  A  cluster  is  defined  by  computing  its  centroid;  i.e.  class  prototype.  For 
this  let  V  =  {i/i,  1^2)  *  *  'i^c]  be  the  set  of  centroids,  i/i  G  <  *  <  c). 

Fuzzy  clustering  algorithms  find  the  optimal  partition  of  the  input  data 
into  clusters  by  minimizing  an  error  criterion,  namely  the  weighted  within 
groups  sum  of  squared  errors  objective  function  [2],  [3],  [4],  [5],  [6],  [9]: 

=  II  (1) 

fc=:l»  =  l 
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subject  to  the  constraints: 


0  <  ^  <  n  Vi 

Jb  =  l 

Uik  =  1  Vi,  (2) 

*=i 

where  Uik  is  the  membership  degree  of  the  kth  feature  vector  in  the  iih 

cluster,  (7  is  a  c  X  n  matrix  of  Uik  values  (1  <  i  <  c,  1  <  ^  <  «);  and  ||  ||^ 
is  a  distance  measure  w.r.t.  the  matrix  A.  U  is  called  the  fuzzy  c-partition  of 
the  initial  data  set.  m  is  the  fuzzifier  used  by  any  fuzzy  clustering  algorithm. 

The  following  theorem  [2]  states  conditions  on  the  fuzzy  c-partition  in 
order  to  minimize  the  objective  function: 

Theorem  1  Assume  1|  Xk  -  Uj  \\\>  0,  Vj,/:  .  (U,  V)  may  minimize  Jm  only 
if,  for  m>  1: 


Uik  = 


/  II  Xk  -  i^i  lU  ^  ^ 


Vi,Ar 


Ui  = 


Vi 


(3) 

(4) 


The  following  section  discusses  the  problems  encountered  by  an  inade¬ 
quate  choice  of  this  parameter  and  introduces  a  general  rule  of  thumb  about 
its  setting. 

3  AN  IMPROVED  SCHEME  FOR  THE  FUZZI¬ 
FIER 

3.1  Basics 

The  fuzzifier  is  used  to  weight  the  distances  between  the  prototype  vectors 
and  the  input  vectors  in  the  computation  of  the  membership  values.  The 
fuzzifier  controls  the  fuzziness  of  the  c-partition  to  be  computed  [4],  [5].  This 
“amount”  of  fuzziness  ranges  from  absolute  hard  clustering  (at  m  =  1)  to 
increasingly  fuzzy  clustering  as  the  fuzzifier  takes  larger  values.  This  means 
that  the  discrepancy  between  memberships  of  a  pattern  in  different  clusters 
is  emphasized  (resp.  reduced)  by  adopting  low  values  (resp.  large  values) 
for  the  fuzzifier.  Furthermore,  this  parameter  influences  considerably  the 


338 


convergence  rate  of  the  considered  algorithms  and  the  cluster  validity  of  the 
data  set  at  hand. 

To  our  knowledge,  no  theoretical  foundations  for  the  optimal  choice  of 
this  parameter  are  yet  available  [3],  [5],  [4],  [6],  [2],  [9],  [10].  In  the  lack  of 
theoretical  basis  for  the  optimal  fuzzifier  that  produces  best  clustering,  the 
appropriate  value  of  this  parameter  is  set  via  experimentation  [3],  [5],  [4],  [6], 
[9],  [10].  In  essence,  the  choice  is  still  largely  heuristic. 

But,  one  can  point  out  a  general  rule  of  thumb  that  guides  the  changing  of 
this  parameter  and  that  addresses  both  the  convergence  rate  and  the  cluster 
validity.  It  can  be  paraphrased  as  follows.  In  fact,  clustering  is  considered  as 
a  process  of  competition  between  prototypes.  A  simple  analysis  of  (3)  and 
(4)  reveals  the  following  remarks.  By  adopting  large  values  for  the  fuzzifier, 
each  prototype  will  be  updated  to  almost  the  same  small  rate  since: 


lim 

m-^oo 


«.•,*(<)  =  ^ 


(5) 


And  therefore  distant  prototypes  will  win  as  much  as  closer  ones.  Conse¬ 
quently,  no  prototype  is  left  during  the  competition  process.  However,  large 
values  are  not  beneficial  to  the  convergence  rate.  Consequently,  by  decreasing 
the  fuzzifier  gradually  to  low  values,  the  convergence  rate  will  be  improved. 

As  a  conclusion  to  this  analysis  a  decreasing  scheme  is  more  appropriate. 
Earlier  works  about  this  subject  are  reported  subsequently. 


3.2  Related  Work 

In  works  reported  in  [5]  and  [2],  the  fuzzifier  was  allowed  to  decrease  linearly 
through  iterations  from  a  high  value  to  a  low  terminating  value  meeting 
the  requirements  discussed  in  §3.1.  In  [2],  this  solution  was  used  in  the 
Descending  FLVQ  (4,FLVQ,  for  short).  has  a  superior  classification 

rate  than  FCM  [4].  Although  the  adopted  decreasing  schemes  for  the  fuzzifier 
in  [2]  and  [5]  improve  the  cluster  validity  and  the  convergence  rate,  they  are 
still  very  sensitive  to  the  initial  and  final  values  of  the  fuzzifier. 

It  is  reported  in  literature  that  an  appropriate  specification  of  the  fuzzifier 
requires  knowledge  of  the  characteristics  of  the  data  set  at  hand  [4].  Hence, 
creating  closer  interactions  between  the  fuzzifier  and  the  input  data  would 
perform  better  results.  It  is  this  property  that  we  investigate  and  use  in  our 
solution  discussed  in  the  next  section. 


3.3  Our  Proposal 

4.FLVQ  [2]  adopts  an  interesting  approach  to  the  problem  of  choosing  the 
fuzzifier.  However,  in  4,FLVQ  the  fuzzifier  takes  the  same  value  for  all  input 
patterns  in  each  iterations  which  may  be  non  optimal.  Moreover,  the  j^FLVQ 
is  very  sensitive  to  the  final  and  initial  values  of  the  fuzzifier  [2].  One  can 
notice  the  fact  that  the  boundaries  between  clusters  might  be  fuzzy.  In  such 
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case  clusters  are  said  to  overlap  [6].  The  degree  of  overlapping  is  not  the 
same  for  all  clusters.  Consequently,  they  should  not  be  treated  equally  as  in 
4.FLVQ  [2].  In  fact,  the  less  the  overlapping  between  clusters  is,  the  lower  the 
values  of  the  fuzzifier  should  be.  Consequently,  this  allows  us  to  better  control 
the  “fuzziness”  (“hardness”)  of  each  cluster  by  taking  into  consideration  the 
data  set  at  hand. 

To  make  the  fuzzifier  depend  on  the  data  set  to  cluster,  we  have  investi¬ 
gated  the  possibility  of  using  the  computed  memberships  of  an  input  vector 
in  all  possible  classes.  In  fact,  the  more  these  memberships  are  away  from 
each  other,  the  more  a  vector  is  attracted  towards  a  particular  class.  In  a 
formal  manner,  we  propose  the  following  scheme: 

mfc(/)  =  mo7c  (1  — V/f=l*--Ti  (6) 


=  ^  («.*(<))"  VA  =  l.-.n  (8) 

»  =  1 

where  mo  is  the  initial  fuzzifier,  (J  is  a  small  real  number  {<5  >  1.0)  and  t  is 
iteration  counter. 

In  order  to  meet  the  requirements  outlined  in  §3.1  the  parameter 
should  increase  with  time.  Indeed,  the  computed  memberships  of  a  given 
pattern  are  more  and  more  away  from  each  other  and  therefore  will  have 
a  large  discrepancy  since  the  considered  pattern  will  be  attracted  through 
iteration  to  a  special  cluster  (resp.  to  more  than  one  cluster  when  overlapping 
is  very  important,  i.e  memberships  are  close).  Due  to  the  constraint  in  (2), 
there  exist  some  memberships  that  should  increase  due  to  the  decrease  of 
the  others.  Thus  when  /?fe(t)  increases,  the  fuzzifier  decreases  accordingly 
showing  a  behavior  analogous  to  simulated  annealing.  Due  to  the  constraints 
in  (2),  a  simple  analysis  of  (8)  reveals  that  the  parameter  ranges  over 

[i,  l] .  In  fact,  the  lowest  value  for  pk{t)  is  obtained  when  all  the  memberships 
are  equal  to  i.  The  largest  value  for  fik{t)  is  obtained  when  a  membership 
degree  of  pattern  ^  is  1  in  one  cluster  and  0  in  the  others.  Hence,  from  (6) 
the  fuzzifier  ranges  over  [<5,  mo  +^].  This  well  justifies  the  fact  that  ^  >  1 
due  to  Theorem  1,  but  this  parameter  should  remain  small. 

Simple  runs  of  our  solution  and  its  comparison  to  that  adopted  in  the 
4.FLVQ  [2]  are  outlined  in  the  next  section. 


4  PRELIMINARY  EXPERIMENTS 


We  have  used  IRIS  data  of  Anderson  and  Fisher  ^  It  consists  of  three 
subspecies,  each  containing  50  samples.  IRIS  data  is  the  most  widely  used 
in  the  experimentation  of  clustering  algorithms  [6],  [5],  [2],  [8].  Indeed  IRIS 
data  has  been  widely  used  by  researchers  in  clustering  since  1936  [8]  mainly 
because  it  presents  important  overlaps  between  subspecies  (subgroups). 

Unlike  the  4'FLVQ  [2],  in  our  solution  the  fuzzifier  does  not  take  the  same 
value  for  all  input  vectors.  Hence,  (3)  and  (4)  will  change  respectively  to: 


Vi,  A:; 


(9) 


Vi. 


(10) 


where  mjt  is  computed  by  (6).  For  the  sake  of  clarity,  we  note  by  the  Data 
Dependent  4-FLVQ  (DD-4.FLVQ,  for  short)  the  ^j-FLVQ  based  on  (9)  and  (10). 

For  a  fare  comparison  between  DD4FLVQ  and  iFLVQ,  we  have  used  the 
same  parameters.  We  based  our  comparison  upon  two  criteria:  the  number 
of  steps  needed  by  the  algorithm  to  converge  and  the  misclassification  rate. 
Indeed,  these  two  criteria  are  the  mostly  used  ones  for  a  fine  comparison 
between  competing  designs  [2],  [5].  For  the  sake  of  simplicity,  the  matrix  A 
used  by  (9)  was  set  to  the  identity  for  both  algorithms. 

An  initial  tracking  of  class  prototypes  can  be  achieved  using  many  tech¬ 
niques;  e.g.  unsupervised  learning  [6],  a  random  process  [5]  [10].  For  our 
concern,  we  will  adopt  the  same  method  used  in  [2].  Due  to  the  space  limit, 
this  initialization  process  will  not  be  described  herein.  The  initial  and  the 
real  prototypes  of  the  IRIS  subspecies  are  depicted  in  TABLES  1  and  2  re¬ 
spectively. 


4.30 

2.00 

1.00 

1.00 

‘'2,0 

6.10 

3.20 

3.95 

1.30 

*^3.0 

7.90 

4.40 

6.90 

2.50 

TABLE  1:  INITIAL  GUESS  OF  CLASS  PROTOTYPES. 

Simple  runs  of  both  algorithms  with  different  settings  of  their  parameters 
are  depicted  in  TABLE  3  where  the  “  ★”  means  that  the  algorithm  does  not 
converge  within  the  maximal  maximal  number  of  allowed  iterations  (=  200). 


*IRIS  data  set  is  a  courtesy  of  Dr  James  M.  Keller. 
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5.00 

3.43 

1.46 

0.25 

ui  = 

5.94 

2.77 

4.26 

1.33 

I'Z  = 

6.59 

2.97 

5.55 

2.03 

TABLE  2:  REAL  PROTOTYPES  OF  THE  THREE  IRIS  SUBSPECIES. 


In  order  to  compute  the  misclassification  rate,  the  same  training  samples 


Parameters 

Final  Prototypes 

- 1 

lt€ 

jr. 

Mi 

sc.  1 

I 

11  1 

1 

ii 

i 

11 

*^1 

U2 

*^3 

U2 

I'Z 

♦  mo  =  8.0 

5.01 

5.98 

6.55 

5.05 

5.97 

6.54 

c  =  0.01 

3.40 

2.84 

3.00 

3.42 

2.86 

3.00 

09 

16 

13 

11 

8  =  1.3 

1.50 

4.44 

5.36 

1.45 

4.42 

5.36 

Am  =  0.04 

0.25 

1.43 

1.95 

0.22 

1.43 

1.96 

tno  =  10.0 

5.02 

6.00 

6.53 

5.01 

5.99 

6.53 

e  =  0.01 

3.40 

2.85 

2.99 

3.40 

2.87 

3.00 

10 

14 

12 

11 

8  =  1.3 

1.50 

4.45 

5.32 

1.49 

4.46 

5.32 

Am  =  0.02 

0.24 

1.43 

1.93 

0.21 

1.44 

1.94 

mo  =7.0 

5.04 

6.00 

6.51 

c  =  0.001 

3.45 

2.88 

3.00 

* 

31 

* 

13 

8  =  1.1 

★ 

1.45 

4.49 

5.27 

Am  =  0.03 

0.23 

1.47 

1.98 

mo  =  5.0 

5.01 

5.95 

6.62 

4.98 

5.77 

6.60 

c  =  0.01 

3.40 

2.82 

3.02 

3.37 

2.84 

3.02 

07 

16 

15 

12 

8  =  1.1 

1.50 

4.41 

5.45 

1.49 

4.25 

5.49 

Am  =  0.01 

0.25 

1.41 

2.00 

0.25 

1.33 

2.04 

TABLE  3:  SIMPLE  RUNS  OF  THE  4FLVQ(I)  AND  THE  DD-iFLVQ(n)  WITH 
VARIOUS  SETTINGS  OF  THEIR  PARAMETERS. 


are  classified  using  prototypes  found  by  both  algorithms.  For  this,  the  One- 
Nearest-Prototype  method  [2]  (1-NP,  for  short)  is  adopted. 

From  TABLE  3,  one  can  notice  that  DD-4,FLVQ  is  more  stable  to  the 
changing  of  its  parameters  (c,mo,^)  than  is  4-FLVQ  (e,mo,  Am).  Further¬ 
more,  from  TABLES  2  and  3  one  can  notice  that  DD-4.FLVQ  approximates 
better  the  real  centroids  than  does  ^FLVQ.  This  explains  the  fact  that  th^e 
number  of  misclassified  patterns  by  DD-jFLVQ  is  less  than  that  by  |FLVQ. 
However,  DD-|FLVQ  requires  in  general  a  greater  number  of  iterations  to 

converge  than  |FLVQ. 

Our  experimental  results  ^  have  shown  that  our  solution  performs  better 
clustering  than  4.FLVQ.  The  difference  in  performance  is  significant  enough 
to  encourage  further  investigations  in  the  suggested  direction. 

2 Experimental  results  can  be  provided  by  the  authors  upon  request. 
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5  DISCUSSION  AND  FUTURE  WORK 


In  a  current  paper,  we  have  developed  a  rough  solution  that  solves  (to  some 
extent)  the  problem  of  the  choice  of  the  fuzziness  parameter  in  fuzzy  clus¬ 
tering  algorithms.  We  tried  to  make  the  fuzzifier  depend  on  the  data  set  to 
cluster.  Experimental  results  have  shown  that  our  solution  performs  good 
clustering  as  that  reported  in  [2],  besides  it  has  a  6e^^er  approximation  prop¬ 
erty. 

We  do  not  pretend  that  our  solution  is  better  than  that  presented  in  [2] 
in  all  the  cases.  It  may  be  possible  to  find  data  sets  on  which  one  solution 
performs  better  clustering  than  the  other.  Our  central  concern  is  to  assess 
closer  interactions  between  the  fuzzifier  and  the  input  data.  Although,  our 
solution  is  still  heuristic,  the  experimental  results  reported  here  show  its 
validity  and  usefulness  which  encourage  further  investigations  in  the  same 
direction.  Embedding  more  knowledge  about  the  data  in  the  computation  of 
the  fuzzifier  (e.g.  cluster  width,  fuzzy  hypervolume  [6])  would  perform  better 
results.  For  this,  an  analysis  of  the  effect  of  such  parameters  on  the  fuzzifier 
should  be  done.  Such  an  analysis  is  not  straightforward.  For  the  time  being, 
we  are  focusing  our  attention  on  the  analysis  of  an  analogous  scheme  to  that 
presented  in  (6)  embedding  the  fuzzy  hypervolume  criteria.  Further  studies 
as  well  as  simulation  results  are  to  be  included  in  a  forthcoming  paper. 
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Abstract.  Neural  network  minimization  problems  are  often  ill- 
conditioned  and  in  this  contribution  two  ways  to  handle  this  will 
be  discussed. 

It  is  shown  that  a  better  conditioned  minimization  problem  can 
be  obtained  if  the  problem  is  separated  with  respect  to  the  lin¬ 
ear  parameters.  This  will  increase  the  convergence  speed  of  the 
minimization. 

The  Levenberg-Marquardt  minimization  method  is  often  con¬ 
cluded  to  perform  better  than  the  Gauss-Newton  and  the  steepest 
descent  methods  on  neural  network  minimization  problems.  The 
reason  for  this  is  investigated  and  it  is  shown  that  the  Levenberg- 
Marquardt  method  divides  the  parameters  into  two  subsets.  For 
one  subset  the  convergence  is  almost  quadratic  like  that  of  the 
Gauss-Newton  method,  and  on  the  other  subset  the  parameters  do 
hardly  converge  at  all.  In  this  way  a  fast  convergence  among  the 
important  parameters  is  obtained. 

1.  INTRODUCTION 

This  contribution  addresses  the  criterion  minimization  to  obtmn  the  param¬ 
eter  estimate  in  a  model.  Two  slightly  different  topics  are  covered. 

It  is  shown  that  a  better  conditioned  minimization  problem  can  be  ob- 
tmned  if  the  problem  is  separated  with  respect  to  the  linear  parameters.  The 
parameters  are  divided  into  two  sets  depending  on  if  the  model  is  linear  or 
nonlinear  with  respect  to  the  parameter.  The  iterative  minimization  can  then 
be  done  over  the  set  of  nonlinear  parameters  instead  of  over  all  parameters. 
The  better  conditioned  problem  is  likely  to  converge  faster  than  the  original 
one.  More  aspects  and  an  overview  about  separable  minimization  problems 
can  be  found  in  [5]. 

Many  different  studies  have  been  done  where  different  minimization  al¬ 
gorithms  are  compared.  See,  c.^.,  [8,  3]  and  further  references  there.  In 
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this  contribution  it  is  pointed  out  why  the  Levenberg-Maxquardt  algorithm 
often  is  concluded  to  be  the  best  one.  The  reason  for  this  is  connected  to 
the  ill-conditioning,  which  often  occurs  in  neural  net  minimization  problems, 
see  [6]. 

In  Section  2  a  short  background  and  problem  formulation  is  given  and  in 
Section  3  the  improved  conditioning  of  separable  minimization  problems  is 
shown.  Section  4  concerns  the  Levenberg-Marquardt  algorithm  and  Section  5 
concludes  the  paper. 

2.  PROBLEM  FORMULATION 

Considering  the  following  fitting  problem.  Given  iV  data  whi^ 

consist  of  an  output  y{t)  and  an  input  regressor  ip{t).  For  simplicity  it  will 
be  assumed  that  the  output  is  one  dimensional.  All  the  results  can  easily  be 
modified  to  cover  the  multi-output  case. 

It  is  assumed  that  there  exists  a  function  /(•)  so  that  the  data  can  be 
described  as 

where  e(t)  can  be  described  by  a  white  noise  sequence.  The  function  /(•)  is 
unknown  and  the  goal  is  to  use  the  data  to  obtain  an  estimate  of  it.  This  is 
done  by  proposing  a  model  g  with  adjustable  parameters 

n 

y{t)  =  g{0,  (p{t))  =  Ckgkijpit),  a*)  (1) 

fc=i 

where  y(t)  is  the  prediction  of  y{t)  and  n  is  the  number  of  basis  function^STfc. 
The  parameters  c*,  o*  are  put  into  a  common  parameter  vector  9  =  [c^  a  ]  . 
Depending  on  the  choice  of  the  basis  functions  different  well-known 

nonlinear  model  structures  like  radial  basis  functions  and  feed-forward  neural 
nets  are  obtained.  The  dimensions  of  the  parameter  a*  also  depends  on  the 
particular  choice  of  basis  functions.  For  example,  if  all  basis  functions  are 
chosen  as  identical  sigmoids,  i.c.,  pib(¥>W,Ofc)  =  +  Oio)  then  (1) 

becomes  a  feed-forward  neural  network  and  the  number  of  neurons  is  given 
by  n.  The  second  index  added  to  the  parameters  indicates  the  param^ers 
connected  to  the  regressor,  Oji ,  and  the  parameter  connected  to  the  position 
of  the  sigmoid  ajo- 

Given  a  model  structure  g  and  an  estimation  data  set  the 

parameter  estimate  9  is  defined  as  the  minimum  of  a  criterion  of  fit,  c.p.,  sum 
of  squared  errors 

(9  =  argminViv(^)  (2) 

$ 

where 

=  (3) 
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and  e(^,  t)  =  y(i)  -  y(9y  t). 

The  paper  concerns  possible  improvements  how  to  compute  the  minimum 
of  Viv(^),  i.e.,  to  compute  the  parameter  estimate  (2).  Before  we  come  to 
the  improvements,  a  more  compact,  vectorized,  notation  will  be  introduced 
which  makes  the  calculations  easier  to  follow. 

The  following  notation  gives  a  compact  way  to  handle  the  whole  data  set 
at  the  same  time  e  =  [e(^,l),  e(^,2), . .  Define  y,  y,  and  (p  in 

the  analogous  way.  Introduce  a  matrix  contmning  all  the  basis  functions  at 
all  time  instants  9  =  \gu  •  •  • »  Pn]* 

Note  that  the  model  structure  (1)  is  linear  in  the  parameters  c  but  non¬ 
linear  in  the  parameters  a.  This  feature  will  be  exploited  in  th^following 
section  to  derive  an  improved  algorithm  to  compute  the  estimate  9n^ 

2.1.  Iterative  Minimization  Scheme 

Here  follows  a  short  background  on  minimization  which  forms  the  base  for 
the  discussions  in  the  following  two  sections.  For  a  thorough  treatment  of  the 
topic  see,  e.p.,  [1]. 

To  compute  9  (2)  one  typically  uses  an  iterative  gradient  based  algorithm 
of  the  following  type 

-/Kjdi.  (4) 

where  /if  is  a  step  length  to  guarantee  a  decrease  of  the  criterion  (3)  in  each 
iteration^.  The  step  direction  di  is  given  by 

d<  =  J?riVViv(e<j>)  (5) 

where  is  a  matrix  which  modifies  the  search  direction  from  the  steepest 
descent 

VVAr(§<;j>)  =  (6) 

to  a  more  favorable  direction. 

Starting  from  an  initial  parameter  value  the  equation  (4)  is  iterated 
until  converges.  Depending  on  the  choice  of  i?i,  different  minimization 
schemes  are  obtained.  If  is  chosen  to  the  Hessian  of  the  criterion, 

=  (7) 

then  one  has  the  Newton  method.  The  second  term  in  (7)  will  typically  be 
small  compared  to  the  first  term  and  since  it  is  much  more  computationally 
expensive  to  compute,  it  is  often  canceled.  Also,  this  gives  a  Ri  that  is  positive 
semi-definite,  which  is  necessary  for  the  algorithm  to  converge.  The  following 
alternative  search  directions  are  more  common  in  practice: 

^Typically  one  starts  with  m  =  1,  and  test  if  If  that  is  not 

the  case  fii  is  decreased  and  a  new  is  computed.  This  process  continues  until  a 

downward  step  is  obtained. 


347 


-  Gradient  direction.  Simply  take  Ri  =  I. 

-  Gauss-Newton  direction.  Use  Ri  =  sr. 

-  Levenherg-Marquardt  direction.  Use  Ri  =  4-  6il.  where  Si  is  used 

instead  of  the  step  size  /if. 

For  the  Gauss-Newton  method  the  computation  of  the  step  direction  d,*  can 
be  formulated  as  linear  quadratic  minimization  problem,  which  has  to  be 
solved  in  each  iteration  of  (4) 

di  =  argmin  ||  6%  -  e  |l2=  (^0'*’^  =  (8) 

di 

from  which  the  definition  of  the  pseudo-inverse  follows.  Similar,  in 
each  iteration  of  the  Levenberg-Marquardt  method  the  following  minimiza¬ 
tion  problem  has  to  be  solved 

di  =  arginin  ||  e'di-e\\2  ll  d<  ||2=  ([  (^/  ])  ^  0]^  (®) 

Remark  1  The  linear  quadratic  minimization  problems  (8)  and  (9)  can  he 
solved  using  QR-factorizationy  which  is  computationally  much  more  efficient 
than  computing  the  pseudo-inverses  of  e'  and  [s'  see  [2]. 

The  following  definitions  will  be  needed  in  the  sequel. 

Definition  1 

The  condition  number  of  matrix  A  is  defined  as  cond(j4)  =  where 

a{A)  are  the  singular  values  of  A.  If  cond(A)  is  very  large  A  {e.g.  >  10^),  is 
smd  to  be  ill-conditioned. 


Definition  2 

An  ill-conditioned  minimization  problem  is  a  minimization  problem  where  e' 
is  ill-conditioned. 

Ill-conditioned  minimization  problems  are  often  troublesome  to  minimize, 
and  the  iterative  search  (4)  usually  converges  much  slower  than  for  better 
conditioned  minimization  problems.  Neural  network  minimization  problems 
are  often  very  ill-conditioned  (see,  e.g.y  [6])  and  that  motivates  the  research 
presented  in  the  following  two  sections. 

3.  SEPARABLE  MINIMIZATION  PROBLEMS 

There  are  two  kinds  of  parameters  in  (1).  The  model  is  separable  with  respect 
to  the  parameters  {c*},  i.e.,  given  the  parameters  {a/t}  then  {c*}  can  be 
calculated  exactly  by  the  least  squares  method,  without  iterative  search.  The 
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idea  to  make  use  of  the  separabillity  feature  is  not  new,  a  good  overview  is 
[5]  and  further  references  can  be  found  in  there. 

However,  it  has,  to  the  authors  best  knowledge,  not  been  appKed  to  neu¬ 
ral  nets  before.  The  derivation  here  deviates  slightly  from  the  one  in  [5],  and 
especially  it  will  be  shown  that  the  separated  problem  is  better  conditioned 
than  the  original  minimization  problem.  This  means  that  the  separated  algo* 
rithm  can  be  expected  to  converge  with  less  iterations,  which  motivates  the 
method.  In  [5]  it  is  shown  that  the  computational  burden  per  iteration  is  of 
the  same  order  for  the  separated  and  the  non-separated  algorithm.  Hence, 
the  use  of  the  algorithm  can  only  be  motivated  because  the  iterations  become 
more  efficient  in  the  meaning  that  a  lower  number  of  iterations  will  be  needed 
to  minimize  the  criterion  (3). 

It  is  possible  to  combine  the  separation  method  with  any  of  the  mini¬ 
mization  methods  mentioned  in  previous  section  and  this  will  be  done  on  an 
example  to  vizualise  the  obtmned  improvement. 

The  madn  idea  is  simple,  the  estimate  of  the  separable  parameters  is  de¬ 
scribed  by 

c{a)  =  {g^y.  (10) 

This  expression  can  be  substituted  into  the  model  (1).  Since  the  parameters 
c  have  been  eliminated  this  gives  a  new  parameterization  of  the  model  with 
less  parameters  than  the  original  problem.  We  will  now  look  into  the  details. 

Divide  the  parameter  update  and  the  derivative  with  respect  to  the  two 
subsets  of  parameters,  dj  =  =  [^c  ^o]  where  the  derivative 

sign  ’  has  been  suppressed  for  the  sake  of  clarity. 

Introduce  the  projection  Pc  onto  the  row  space  of  Ec 

Pc  = 

and  the  complementary  projection  Qc  =  I  Pc  onto  the  kernel  of  ej.  Since 
Pc~hQc  =  I  the  norm  in  (8)  can  be  re-written  in  the  following  way 

II  e'di  -  e  ||2=||  (Pc  +  Qc)[sc  Soldi  -  (Pc  +  Qc)e  |P= 

II  (QcSttdi  -  QcS)  +  (£cdi  +  PcSadf  -  PcS)  ||2= 

IIQcSod!-Qcslh  +  llsodl  +  Pceod}-Poe  lb  (11) 

In  the  second  step  Qc^c  =  0  and  PcEc  =  were  used  and  the  last  equality 
follows  from  the  fact  that  PcQc  =  0-  The  first  term  of  (11)  can  be  minimized 
independently  of  the  second  term  and  instead  of  minimizing  the  second  term 
one  can  use  (10)  to  obtsdn  c.  Hence,  if  Gauss-Newton  method  is  applied  to 
the  separated  problem  the  parameter  update  direction  becomes 

df  =  argmin  1|  QcSadf  -  QcE  |l2==  (Qcea)'^Qc£-  (12) 

If  the  Levenberg-Marquardt  algorithm  is  preferred  one  has  to  modify  this 
expression  in  analogy  with  (8)  and  (9). 

Given  an  initial  estimate  of  the  following  algorithm  applies 
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Algorithm  1  Separable  Gauss-Newton  minimization  algorithm 

1.  Compute  c(o)  with  (10). 

2.  Compute  £a  and  £c  and  form  Qc- 

3.  Compute  df  according  to  (12)  and  compute  a  candidate  with  a  step 
length  =  1. 

l  Compute  c(o»)  wUh  (10)  and  check  that  <  Viv(e'‘>).  If  not, 

repeat  from  2  with  a  smaller  fii. 

5.  If  has  not  converged,  repeat  from  2. 

To  obtmn  a  separable  Levenberg-Marquardt  algorithm  the  steps  3  and  4  have 
to  be  modified  in  analogy  with  (8)  and  (9). 

The  following  theorem  shows  that  the  separated  minimization  problem 
becomes  better  conditioned  than  the  original  one. 

Theorem  1  The  separated  minimization  problem  is  better,  or  at  least  as  good 
conditioned  as  the  non-separated  minimization  problem. 

Proof:  It  is  to  be  shown  that 

cond(Ocea)  <  cond([ec  ^a])-  (13) 

To  do  that  we  will  make  use  of  the  following  easily  shown  facts.  Let  A  be  an 
n  X  n  matrix.  A  reduced  matrix  A  is  obtained  from  A  by  deleting  one  row 
and  one  column  from  A.  Then  max((7(A))  >  max(cr(A))  and  min(cr(A))  < 
min(o'(A))  where  or(-)  are  the  singular  values  of  the  matrix. 

1.  From  the  definition  (13)  it  then  follows  that  cond(A)  <  cond(A). 

2.  If  A  is  invertible  cond(A)  =  cond(A“^). 

3.  If  A  is  n  X  m  and  n  >  m  then  cond(A^A)  =  cond(A)2. 

Using  these  facts  one  has 

cond([ec  e,]f  =  cond  (  J  ]  )  = 

([  C  (ele„  -  " 

cond(eXeo  -  =  cond(eJ(3c<3ceo)  =  cond((3cea)^-  (14) 

In  the  second  step  the  matrix  is  inverted,  A,  B,  and  C  are  sub-matrices 
without  interest  in  this  context  and  they  are  removed  in  the  third  step  giving 
the  inequality.  The  clmm  (13)  follows  directly  from  (14).  □ 

As  mentioned  in  the  beginning  of  the  section  it  is  shown  in  [5]  that  the 
computational  burden  of  a  single  iteration  is  not  changed  due  to  using  the 
separated  Algorithm  1.  Instead,  Theorem  1  gives  the  benefit  of  the  separation 
method:  a  better  conditioned  minimization  problem. 
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3.1.  Example 

A  data  set  of  300  samples  is  obtained  in  the  following  way.  First  300  input 
data  with  dimv?  =  5  are  generated,  ^  Gaussian  distribution 

with  unit  variance.  Then  the  output  data,  are  obtained  with  a  one 

hidden-layer  feed-forward  neural  net  with  5  inputs  and  1  output.  The  param¬ 
eters  Ofc  are  chosen  randomly  from  a  Gaussian  distribution  with  variance  2 
and  Cjfc  from  a  Gaussian  distribution  with  variance  6.  The  data  are  then  used 


Figure  1:  Criterion  (3)  as  a  function  of  the  number  of  iterations  of  a  Gauss-Newton 
search.  Solid  line:  Standard  Gauss-Newton  method.  Dashed  line:  Gauss-Newton 
applied  to  the  separated  problem. 

to  estimate  a  model  of  the  same  kind  as  the  one  which  generated  the  data. 
The  initial  parameter  values  of  the  model  are  generated  in  the  same  way  as 
described  above.  Prom  the  initial  parameter  values  the  criterion  is  iteratively 
minimized  using  the  normal  Gauss-Newton,  and  the  separated  Gauss-Newton 
method,  Algorithm  1.  The  result  is  depicted  in  Figure  1,  It  is  clear  from  the 
figure  that  the  separated  version  succeeds  much  better.  The  standard  Gauss- 
Newton  terminates  already  after  a  few  iterations.  The  reason  for  this  is  the 
ill-conditioning  of  the  minimization  problem  and  that  will  be  explained  in  the 
following  section. 

4.  CONVERGENCE  SPEED  OF  MINIMIZATION  ALGORITHMS 

In  this  section  an  explanation  will  be  given  why  the  Levenberg-Marquardt 
algorithms  often  performs  best  on  neural  network  minimization  problems. 
Agmn  it  has  to  do  with  the  ill-conditioning  which  has  already  been  discussed. 

It  is  well  known  that  steepest  descent  search  for  the  minimum  is  inefficient, 
especially  for  ill-conditioned  problems  close  to  the  minimum.  In  such  cases  it 
is  usually  expected  that  the  Newton,  or  Gauss-Newton  method  will  perform 
better.  See,  e.g.,  [1].  For  these  methods  the  matrix  Ri  in  (5)  transforms  the 
search  direction  so  that  all  parameters  become  equally  important.  This  is 
illustrated  in  Figure  2,  where  the  level  curves  of  Vjv(^)  are  shown.  Figure  2 
a)  corresponds  to  an  ill-conditioned  problem  and  the  valley  is  a  direction  in 
the  parameter  space  which  has  little  influence  on  V)v(^).  The  steepest  descent 
method  needs  many  iterations  along  the  valley  before  the  minimum  is  reached. 
The  Gauss-Newton  method  on  the  other  hand,  transforms  Vn{0)  so  that  all 
directions  become  equally  important,  depicted  in  Figure  2  b),  and  then  the 
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convergence  is  typically  much  faster.  Since  the  Gauss-Newton  method  rely  on 


Figure  2:  Level  curves  of  the  criterion  Vn{9),  a)  of  an  ill-conditioned  problem,  b) 
as  seen  by  the  Gauss-Newton  method. 

a  quadratic  Taylor  expansion  of  V)v(^)  at  the  gain  in  convergence  depends 
on  how  close  is  to  be  quadratic.  It  is  clear  that  the  quadratic  Taylor 

expansion  is  good  only  in  a  neighborhood  from  the  current  estimate  ,  and 
a  too  large  parameter  step  brings  the  estimate  outside  this  region.  Let  us 
now  look  how  the  parameter  steps  look  like  with  the  different  methods. 

Since  Ri  =  is  symmetric  it  can  be  written  Ri  =  TiQiT^^  where 

the  singular  values,  {gj^},  of  Ri  are  the  trace  of  the  diagonal  matrix  Qi  and 
Ti  consists  of  the  eigenvectors.  The  parameter  update  direction  (5)  for  the 
Levenberg-Marquardt  method  then  becomes 

di  =  [Ri  +  S1)-^VVn(^^)  =  Tidiag  . 

\qi-ho  qd  +  oj  ' 

(15) 

where  d  is  the  number  of  parameters.  The  corresponding  update  direction 
for  the  Gauss-Newton  method  is  obtained  by  setting  =  0  in  (15). 

Given  a  >  0,  for  those  directions  corresponding  to  small  eigenvalues 
S  ^  qk  Gauss-Newton  gives  a  large  step  size  l/g*.  Levenberg-Marquardt 
on  the  other  hand  gives  a  small  step  of  size  l/{qk  +  S)  «  l/S.  For  those 
directions  corresponding  to  large  eigenvalues  S  <  Qk  Levenberg-Marquardt 
gives  a  step  size  1/ {qk  -f  5)  «  l/g*  approximately  equal  to  the  Gauss-Newton 
step.  In  this  way  Levenberg-Marquardt  divides  the  parameter  directions  into 
two  sub-classes.  Within  the  first  class  one  has  an  efficient  convergence  of 
“Gauss-Newton  type”,  and  within  the  second  class  one  has  a  slow  converging 
steepest  descent  method  with  step  length  =  1/^.  In  this  way  one 
consider  the  Levenberg-Marquardt  method  to  be  “in  between”  the  Gauss- 
Newton  method  and  the  steepest  descent  method.  What  is  the  advantage  of 
excluding  some  of  the  directions  from  the  efficient  fit? 

There  are  two  reasons  to  exclude  the  directions  corresponding  to  the  small 
singular  values  of  Qi  from  the  parameter  update  step,  t.c.,  to  exclude  shallow 
valleys  like  in  Figure  2  a)  from  the  fit.  First,  in  a  direction  with  small  Qk 
Gauss-Newton  takes  a  large  step  1/qk.  This  is  likely  to  be  a  step  outside  the 
region  where  the  second  order  Taylor  expansion  holds.  Hence,  to  obtmn  a 
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decrease  in  the  criterion  Vjv(^)  the  step  length  fii  in  (4)  must  be  small.  The 
important  parameters,  corresponding  to  large  qk  will  then  also  be  changed 
with  the  same  small  value  fn  instead  with  (for  them)  the  optimal  one  (/ii  =  1). 
This  means  that  a  very  small  step  is  taken  in  the  direction  where  the  criterion 
decreases  at  most,  and  with  a  small  step  size  one  has  to  perform  a  lot  of 
iterations  before  the  minimum  is  reached.  It  is  clear  that  in  such  situations  it 
is  advantageous  to  divide  the  parameters  in  the  way  the  Levenberg-Marquardt 
method  does  it. 

The  second  reason  to  exclude  those  parameter  directions  which  only  influ¬ 
ence  the  criterion  V/v(^)  marginally  has  to  do  with  the  bias  variance  trade-off. 
Neural  nets  are  often  over-parameterized  which  means  that  they  contain  more 
parameters  than  necessary.  The  net  does  not  benefit  of  the  freedom  which 
some  of  the  parameters  give  it.  Call  these  parameters  spurious^  in  contrast 
to  the  useful  parameters.  It  is  the  spurious  parameters  which  give  the  shal¬ 
low  valleys  in  the  criterion  function  V/v(^),  and  the  spurious  parameters  will 
typically  only  fit  the  noise  in  the  data.  It  is  hence  advantageous  to  exclude 
them  from  the  fit.  This  can  be  done  by  using  regularization  or  by  stopping 
the  minimization  before  the  minimum  is  reached.  See,  e.^.,  [7,  4]. 

It  remains  to  decide  upon  the  factor  S  in  (15).  It  is  typically  chosen  by 
trail  and  error.  Given  an  initial  value  of  (J,  if  that  value  does  not  give  a 
decrease  of  the  criterion  then  it  has  to  be  increased.  On  the  other  hand,  if  it 
decreases  the  criterion,  then  a  smaller  value  is  tried  in  the  following  iteration. 

Similar  arguments  cam  be  used  to  explain  why  conjugate  graulient  min¬ 
imization  method,  and  even  steepest  descent  method,  can  be  better  than 
Gauss-Newton  method  on  ill-conditioned  minimization  problems. 

4.1.  Example 

Let  us  continue  with  the  exaunple  from  the  previous  section.  IVom  the  sarnie 
initial  pa^atmeter  point  the  Levenberg-Mairquardt  amd  the  steepest  descent 
aJgorithms  are  applied  -  with  the  standaurd  scheme  amd  with  the  separated 
Algorithm  1.  The  results  are  depicted  in  Figure  3.  Note  that  the  j/-axes 
au:e  differently  scaded.  The  results  au’e  strikingly  in  favor  of  the  sepairated 
Levenberg-Maurquau-dt  algorithm. 

5.  CONCLUSIONS 

The  following  conclusions  can  be  done  upon  ill-conditioned  neural  net  mini¬ 
mization  problems: 

•  The  conditioning  of  neural  network  minimization  problems  cam  be  im¬ 
proved  by  separating  the  problem  with  respect  to  the  linear  parameters. 
This  increases  the  convergence  rate  of  the  minimization. 

•  The  Levenberg-Marquau*dt  adgorithm  typicadly  converges  faster  tham  the 
Gauss-Newton  method  due  to  that  an  efficient  convergence  is  obtained 
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Figure  3:  Criterion  (3)  as  a  function  of  the  number  of  iterations  of  a)  a  Levenberg- 

Marquardt  search  b)  Steepest-descent  search.  Solid  line:  Standard  method.  Dashed 

line:  respectively  method  applied  to  the  separated  problem. 

in  the  important  parameter  directions.  At  the  same  time  the  parameters 
which  do  not  influence  the  criterion  substantially  are  held  almost  fixed. 
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Abstract  -  Training  recurrent  networks  is  generally  believed  to  be  a 
difficult  task.  Excessive  training  times  and  lack  of  convergence  to  an 
acceptable  solution  are  frequently  reported.  In  this  paper  we  seek  to 
explain  the  reason  for  this  from  a  numerical  point  of  view  and  show 
how  to  avoid  problems  when  training.  In  particular  we  investigate  ill- 
conditioning,  the  need  for  and  effect  of  regularization  and  illustrate  the 
superiority  of  second-order  methods  for  training. 

INTRODUCTION 

Recurrent  neural  networks  are  an  interesting  class  of  models  for  signal  pro¬ 
cessing  as  they  are  able  to  build  up  internal  memory  suited  for  the  task  at 
hand  and  thus  often  lead  to  compact  model  representations.  However,  it  is 
generally  believed  to  be  a  difficult  task  to  train  this  type  of  networks.  Several 
authors  have  addressed  the  learning  problem  for  recurrent  networks,  e.g.,  in 
the  context  of  sequence  classification  when  required  to  store  information  for 
an  arbitrary  period  of  time  [1,  5]  but  to  the  best  of  the  authors  knowledge  no 
one  have  treated  the  problem  from  a  general  numerical  point  of  view. 

Feedforward  networks  were  treated  extensively  from  a  numerical  point 
of  view  in  [7]  where  it  was  illustrated  how  training  forms  an  extremely  ill- 
conditioned  optimization  problem.  In  this  contribution  we  extend  this  analy¬ 
sis  to  include  recurrent  networks.  In  particular  we  identify  redundant  connec¬ 
tions  and  illustrate  how  ill-conditioning  may  otherwise  arise,  which  motivates 
the  use  of  regularization. 

Having  acknowledged  the  need  for  regularization  makes  way  for  the  highly 
effective  second-order  methods  for  training.  In  this  contribution  we  partic- 
ulary  focus  on  the  damped  Gauss-Newton  method  and  illustrate  how  this 
method  by  far  outperforms  gradient  descent  on  a  time  series  prediction  prob¬ 
lem,  namely  the  Santa  Fe  laser  data.  The  focus  in  this  contribution  is  on 
time  series  prediction,  but  the  results  generalize  to  other  applications  as  well. 

ARCHITECTURE 

The  general  architecture  of  the  networks  considered  here  are  fully  connected 
feedback  networks  with  one  hidden  layer  of  nonlinear  units  and  a  single  linear 
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output  unit.  The  output  y{t)  of  the  network  is  linear  in  order  to  allow  for 
arbitrary  dynamical  range,  and  is  given  by 

Nh 

2/ W  =  XI  '^oiSiit)  +  Wob  (1) 

i—l 

where  Nh  is  the  number  of  hidden  units,  Wpi  is  the  weight  to  the  output  unit 
from  hidden  unit  j  and  Wob  is  a  bias  weight.  The  output  Si{t)  from  hidden 
unit  i  at  time  t  is  computed  as 

/n^  Nj  \ 

SiW  =  /  [  -  1)  +  WioVit  -  1)  -{■'^WikXk(t)  +  Wib  1  (2) 

\j=l  J 

where  Wij  is  the  weight  to  hidden  unit  i  from  hidden  unit  j,  Wio  is  the  weight 
to  hidden  unit  i  from  the  output  unit  and  Wib  is  the  bias  weight  for  hidden 
unit  z.  Xk(t)  is  the  fc’th  element  in  the  external  input  vector  x{t)  at  time  t 
and  N I  is  the  total  number  of  external  inputs.  /(•)  is  the  nonlinear  activation 
function,  in  this  work  we  use  f{x)  =  tanh(x). 

Note  that  the  update  of  the  recurrent  network  presented  above  is  layered, 
as  the  outputs  Si(t)  from  the  hidden  units  are  computed  immediately  before 
the  computation  of  the  output  unit  output.  This  is  opposed  to  the  update 
presented  in  e.g.  [10]  where  all  the  units  are  updated  simultaneously.  In  [6]  it 
was  shown  that  when  using  fully  recurrent  networks  for  forecasting,  layered 
update  is  preferable  since  synchronous  update  of  the  units  effectively  results 
in  a  two-step  ahead  predictor.  Note  also  that  the  linear  output  unit  does  not 
have  feedback  of  its  own  previous  value.  This  is  in  order  to  avoid  stability 
problems  that  are  otherwise  likely  to  occur. 

Training 

In  this  work  we  focus  on  time  series  prediction  in  which  case  the  input  vector 
contains  delayed  elements  of  the  time  series,  x(t)  =  [a:(t), . . .  ,a:(t  —  iV/  -h  1)], 
and  the  network  output  is  a  prediction  of  the  next  value  in  the  series,  x(t-{-l)  = 
y{t).  Training  the  network  means  adjusting  the  weights  so  as  to  minimize  a 
cost  function.  Most  applications  are  based  on  the  sum  of  squared  errors, 

1  ^ 

^^(w)  = ’  e{t)  ^  x(t -i- 1)  -  y(t)  (3) 

^  t=i 

where  T  denotes  the  number  of  training  examples  and  w  is  the  concatenated 
set  of  parameters.  The  adjustment  of  the  parameters  is  done  off  line  by  an 
iterative  sheme,  =  w*  -  rjAwk,  where  Aw*  indicates  the  direction  of 
change  and  rj  is  the  (adaptive)  size  of  the  step.  When  training  recurrent 
neural  networks  the  most  commonly  used  scheme  is  gradient  descent,  where 
the  direction  Awk  is  equal  to  the  gradient  g,  gi  =  dE{wk)/dwi.  Unfortu¬ 
nately  this  method  suffers  from  extremely  slow  convergence,  and  the  quality 
of  resulting  solutions  is  often  not  satisfactory. 
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Experiments  have  shown  that  much  more  efficient  training  can  be  obtained 
by  using  second-order  methods  [6].  Here  we  focus  on  the  damped  Gauss- 
Newton  method  [3],  in  which  the  search  direction  Aw*  is  determined  by 

Awfc  =  H“ig  (4) 

where  H  is  the  positive  semidefinite  approximation  to  the  Hessian, 


^  \dy(t)dy{t)  d‘^y{t)  ' 

""  dwidwj  ^  dwi  dwj  ^^^^dwidwj 


(5) 


In  each  iteration  k  the  step  size  t)  is  determined  by  line  search  which  makes 
the  method  globally  convergent  [3];  here  we  recommend  a  simple  approach 
where  rj  is  halved  until  a  decrease  in  the  cost  is  obtained  [3].  The  iterations  are 
continued  until  convergence,  determined  by  a  sufficiently  small  length  of  the 
gradient,  ||g||2  <  e.  The  Gauss-Newton  method  involves  finding  the  solution 
to  a  linear  system  of  equations  HAw^  =  g  in  each  iteration,  but  the  increased 
computational  burden  is  justified  by  a  dramatic  increase  in  convergence  and 
thus  reduction  of  overall  training  time,  even  for  large  networks  as  we  shall  see. 
However,  the  success  of  the  damped  Gauss-Newton  method  relies  heavily  on 
the  conditioning  of  the  training  problem,  as  is  the  case  for  gradient  descent. 


ILL-CONDITIONING 

When  training  using  either  gradient  descent  or  the  Gauss-Newton  method, 
a  measure  of  great  importance  for  the  convergence  is  the  condition  number 
of  the  Hessian  H.  For  a  symmetric  positive  definite  matrix  H,  the  condition 
number  is  defined  as  /c(H)  the  ratio  between  the  largest  and 

smallest  eigenvalue  of  H.  If  the  condition  number  is  large^  the  Hessian  be¬ 
comes  ill-conditioned.  The  convergence  rate  will  suffer  and  the  solution  to  the 
linear  system  of  equations  (4)  in  the  Gauss-Newton  method  becomes  unreli¬ 
able.  As  a  rule  of  thumb  the  solution  may  not  be  trustworthy  if  k(H)  > 
where  e  denotes  the  machine  precision  [3].  For  the  IEEE  64-bit  floating  point 
representation  this  is  equivalent  to  k(H)  >  6.7-10'^.  This  may  seem  as  a  large 
number,  but  this  order  of  magnitude  is  not  uncommon  in  the  framework  of 
either  feedforward  networks  [7]  or  recurrent  networks  as  we  shall  see. 

In  [2]  it  was  shown  that  an  eigenvalue  of  the  order  of  the  number  of 
input  variables  could  be  avoided  if  the  mean  was  subtracted  from  each  of  the 
input  variables  Xk  (t)  and  if  a  symmetric  activation  function  is  used.  However, 
these  simple  countermeasures  are  not  adequate  for  avoiding  ill-conditioning 
in  recurrent  networks,  as  the  analysis  in  the  following  will  show. 

The  Hessian  (5)  can  also  be  written  as 

where  J  is  the  Jacobian  matrix,  whose  columns  are  the  partial  derivatives 
of  the  network  output  at  each  timestep  in  the  training  series.  If  J  is  rank- 
deficient  some  of  the  columns  are  linearly  dependent,  which  is  indicated  by 
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singular  values  with  the  value  zero  in  an  SVD  analysis.  This  again  leads  to  a 
singular  Hessian  and  thus  an  infinite  condition  number-  In  practice  it  is  rare  to 
find  columns  in  J  that  are  exactly  dependent  and  thus  singular  values  that  are 
exactly  zero  [7].  However,  it  is  often  the  case  that  columns  are  nearly  linearly 
dependent,  which  leads  to  very  small  singular  values  of  J  and  thus  large 
condition  numbers  for  the  Hessian  H,  In  the  following  sections  we  describe 
situations  leading  to  ill-conditioning  of  J  for  recurrent  networks,  arising  from 
both  exact  and  approximate  linearly  dependencies  between  columns  in  J . 


Exact  dependency 

For  the  type  of  recurrent  networks  defined  by  (1)  and  (2)  there  is  built-in  rank 
deficiency  in  the  Jacobian  since  it  is  easy  to  show  that  some  of  the  columns 
in  J  will  always  be  linear  combinations  of  each  other.  This  is  illustrated  by 
an  example  for  a  small  network,  but  the  result  apply  for  networks  with  an 
arbitrary  number  of  hidden  units.  The  network  considered  here  involves  only 
one  external  input  and  one  hidden  unit,  and  the  output  is  thus  defined  as 

y(t)  =  WoiSi{t) -\-Wob  W 

si{t)  -  f(wnSi{t-l)-\-wioy{t-l)-\-wixx{t)-^wib)  (8) 

=  f  ((wii  WioWol)si{t  —  1) WixX{t)  {wib  WioWob))  (9) 

where  (9)  is  obtained  by  insertion  of  (7)  in  (8).  We  see  that  the  network 
output  will  remain  unchanged  as  long  as  the  total  weighting  ki  of  si{t  -  1), 
ki  =  wn  +  wioWoiy  and  the  total  bias  ^2  on  the  hidden  unit,  ^2  =  wn  + 
wioWob,  remains  constant.  Woi  and  Wob  can  not  be  changed  without  directly 
affecting  the  network  output  (7)  and  are  therefore  kept  fixed  which  we  denote 
by  *.  However,  changes  in  wn,  Wio  and  wib  that  satisfies  both  expressions 

wn  +  w*i  •  wio  0  =  ki 

0  -f  •  wio  +  wib  =  k2 

will  leave  the  network  output  unchanged.  The  expressions  (10)  form  hyper¬ 
planes  in  parameter  space  spanned  by  icn,  wio  and  wn,  and  their  line  of 
intersection  is  computed  as  {wn,wio,wib)  =  (^1,0, ^2)  + 
parametrized  by  t.  The  line  defines  a  direction  in  parameter  space  in  which 
the  network  output  is  constant.  The  constant  network  output  means  that 
derivatives  are  zero  in  this  direction.  Thus,  columns  in  the  Jacobian  corre¬ 
sponding  to  {wii,wio,wib)  are  linearly  dependent. 

When  investigating  Jacobians  for  the  dependency  problem  outlined  above 
it  is  however  uncommon  to  encounter  singular  values  exactly  equal  to  zero; 
but  according  to  the  derivations  this  clearly  ought  to  be  the  case.  The  rea¬ 
son  for  this  is  the  initialization  of  previous  state  values  when  starting  up  the 
network.  If  the  recurrent  network  starts  iteration  at  time  t  =  1  it  is  common 
practice  [10]  to  set  the  previous  states  of  the  hidden  units  as  well  as  their 
derivatives  to  zero,  Si(0)  =  0  ,  dsi{0)/dw  =  0.  This  startup  procedure 
clearly  marks  an  initial  discontinuity  in  the  recursive  equations  (7)  and  (8) 
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governing  the  feedback  network.  Thus  initially  the  partial  derivatives  wrt. 
the  involved  weights  in  the  Jacobian  will  generally  not  be  linearly  dependent. 
But  after  a  few  iterations  indicating  a  transient,  the  dependency  arises  with 
increasing  accuracy.  The  linear  dependency  is  eliminated  if  we  omit  the  feed¬ 
back  weights  Wio  from  the  output  to  the  hidden  units  z,  as  the  degeneracy 
can  then  no  longer  occur.  This  elimination  has  no  influence  on  the  modeling 
capabilities  of  the  network  since  the  remaining  weights  can  be  adjusted  so 
that  the  network  output  remains  unaffected. 


Approximate  dependency 

Even  though  removal  of  the  feedback  weights  Wio  leading  from  the  linear  out¬ 
put  to  the  hidden  units  removes  the  problem  of  almost  exact  rank-deficiency 
in  the  Jacobian  for  recurrent  networks  it  does  not  eliminate  ill-conditioning 
as  experiments  show.  In  [7]  the  problem  of  ill-conditioning  was  analyzed  for 
feedforward  networks  by  careful  examination  of  the  components  entering  the 
partial  derivatives  dy{t)ldwi  of  the  network  output  and  it  w'as  found  that 
ill-conditioning  in  the  Jacobian  can  arise  from  at  least  these  three  reasons 
(assuming  that  the  external  inputs  are  not  proportional): 

1.  The  output  from  a  hidden  unit  is  saturated  and  constant  (=  ±1). 

2.  The  outputs  from  two  hidden  units  are  approximately  proportional. 

3.  The  derivatives  of  two  hidden  unit  outputs  wrt.  their  activations  are 
approximately  proportional. 

Theoretical  and  empirical  examinations  of  the  components  entering  the  par¬ 
tial  derivatives  for  recurrent  networks  reveal  that  ill-conditioning  may  arise 
here  from  the  same  reasons;  such  analysis  is  however  not  included  here. 

Situation  2  where  the  outputs  of  two  hidden  units  are  proportional  and 
thus  highly  correlated  often  occurs  in  practice;  e.g.,  in  [9]  high  correlation 
between  hidden  unit  outputs  was  found  and  studied  for  feedforward  networks. 

The  effects  of  situation  2  are  similar  to  the  effects  of  exact  dependen¬ 
cies  described  above,  as  we  can  determine  directions  in  parameter  space  in 
which  the  cost  function  is  approximately  constant.  For  recurrent  networks 
this  situation  is  much  more  severe  than  for  feedforward  networks  since  the 
degeneracy  will  not  only  affect  weights  leading  to  the  output,  but  also  many 
weights  connecting  the  hidden  units  as  the  experiments  will  show. 

The  scenarios  listed  above  lead  to  nearly  linearly  dependencies  between 
the  columns  of  J  and  thus  to  small  eigenvalues  in  H.  However,  the  condition 
number  of  a  matrix  is  determined  by  the  ratio  between  the  largest  and  smallest 
eigenvalues,  thus  problems  do  not  only  arise  from  small  singular  values  but 
also  from  large  values.  As  mentioned,  the  situations  described  above  will  lead 
to  directions  in  parameter  space  where  the  cost  is  approximately  constant, 
thus  when  training  using  the  Gauss-Newton  method  the  search  direction  will 
be  dominated  by  these  directions  leading  to  an  unrestrained  growth  in  the 
magnitude  of  the  affected  weights.  This  again  leads  to  a  significant  growth  in 
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the  magnitude  of  several  of  the  columns  in  the  Jacobian  since  many  derivatives 
are  dominated  by  terms  of  the  form  [10] 


dskjt) 

dwpq 


a 


Nh 

J=i 


dsj{t  -  1) 


dw. 


'pq 


(11) 


which  becomes  large  if  the  weights  Wkj  become  large.  The  large  elements  in 
the  Jacobian  lead  to  an  overall  upward  scale  of  the  elements  in  J  and  thus 
to  an  upward  shift  of  the  eigenvalues. 


REGULARIZATION 

A  traditional  method  for  handling  ill-conditioning  is  by  regularizing  the  cost 
function  [3,  4].  A  simple  yet  highly  effective  regularization  can  be  obtained 
by  augmenting  the  cost  function  by  a  simple  quadratic  weight  decay  [4], 

C'(w)  =  E(w)  -I-  ^w'^w  (12) 

Simple  weight  decay  is  often  primarily  considered  as  a  means  for  avoiding 
overfitting  as  it  puts  constraints  on  the  parameters  and  thus  reduces  the 
degrees  of  freedom.  Weight  decay  should  however  also  be  considered  from 
its  regularizing  effects.  The  immediate  effect  is  that  a  gets  added  to  the 
diagonal  of  the  Hessian  which  puts  a  lower  bound  on  the  smallest  eigenvalues, 
since  it  is  easy  to  show  that  A(H  al)  =  A(H)  -|-  a.  Another  effect  is  the 
limit  imposed  on  the  growth  of  the  weights  which  prevents  near  singulcir 
directions  in  parameter  space  from  dominating  the  search  directions  obtained 
by  the  Gauss-Newton  method,  thus  greatly  improving  the  efficiency  of  the 
optimization.  The  constraints  put  on  the  weights  by  the  regularization  has 
a  smoothing  effect  on  the  cost  function  which  was  clearly  illustrated  in  [6]. 
Here  it  was  also  demonstrated  that  the  significance  of  the  second  order  term 
ignored  in  (5)  diminishes  when  using  simple  weight  decay  as  regularization. 

EXPERIMENTS 

In  the  first  experiment  we  illustrate  how  ill-conditioning  results  from  some  of 
the  situations  described  herein  and  how  regularization  improves  training.  For 
this  experiment  we  used  a  simple  recurrent  network  to  predict  the  laser  data 
from  the  Santa  Fe  time  series  prediction  competition  [8].  The  data  were  scaled 
so  that  the  first  1000  points  used  for  training  had  zero  mean  and  unit  variance 
and  the  following  100  values  were  used  as  a  test  set.  The  network  used  had 
one  external  input  and  three  hidden  units;  there  were  no  feedback  from  the 
linear  output  unit  to  the  hidden  units  as  found  appropriate  above.  Training 
was  performed  initially  using  five  iterations  of  gradient  descent  followed  by 
the  damped  Gauss-Newton  method.  In  the  left  panel  of  Figure  1  is  shown  the 
evolution  of  the  mean  squared  errors  normalized  by  the  variance  of  the  sets 
(NMSE,  [8])  when  training  without  regularization.  It  seems  that  training  is 
converging  to  a  solution,  but  this  is  not  the  case  as  the  evolution  of  the  weights 
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Figure  1:  Training  without  regularization.  Left  panel:  Evolution  of  training  and 
test  errors.  Right  panel:  Evolution  of  the  weight  values. 

in  the  right  panel  of  Figure  1  shows.  What  happens  is  that  the  outputs  of 
two  hidden  units  become  almost  proportional;  this  is  revealed  by  the  cosine 
to  the  angle  6  between  vectors  containing  their  outputs  on  the  training  set 
which  at  iteration  100  is  cos  6  =  0.9998.  This  corresponds  to  situation  2  listed 
above.  The  weights  that  grow  in  magnitude  are  the  pairs  of  weights  leading 
from  these  two  units  to  every  unit  in  the  network  including  the  output.  Note 
that  the  error  and  thus  the  network  output  is  unaffected  since  the  effects  of 
the  changes  in  the  growing  weights  cancel  out  due  to  the  dependency  between 
the  hidden  units. 

The  condition  number  during  training  is  shown  in  the  left  panel  of  Fig¬ 
ure  2  and  is  seen  to  grow  enormously.  The  rapid  increase  occurs  shortly  after 
the  initiation  of  the  second-order  method  which  quickly  ‘discovers’  the  depen¬ 
dency  between  the  hidden  unit  outputs.  The  near  singular  Hessian  H  leads 
to  very  large  weight  changes  in  some  directions  when  solving  (4).  The  large 
steps  are  however  handled  by  the  line  search  which  returns  very  small  step 
sizes,  indicated  by  the  smooth  increase  in  the  weight  magnitudes.  In  the  right 
panel  of  Figure  2  is  shown  the  eigenvalues  of  the  Hessian  after  iterations  7,  20 
and  100.  At  each  of  the  iterations  it  is  seen  that  the  condition  number  results 
from  both  very  small  as  well  as  very  large  eigenvalues  and  we  note  that  as 


number  for  H.  Right  panel:  Eigenvalues  after  iterations  7,  20  and  100. 
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Figure  3:  Training  with  regularization,  a  =  10“^.  Left  panel:  Evolution  of  training 
and  test  errors.  Right  panel:  Evolution  of  the  weight  values. 

training  progresses  the  eigenvalues  extend  both  upward  and  downward. 

The  training  was  then  repeated  using  the  exact  same  initial  weights  and 
the  same  training  approach,  but  now  with  a  regularization  term  added  to  the 
cost  function,  using  a  =  10“^.  In  the  left  panel  of  Figure  3  is  shown  the 
resulting  evolution  of  the  errors.  The  positive  effect  of  the  regularization  is 
evident,  -as  the  final  errors  are  several  orders  of  magnitude  below  the  levels 
shown  in  Figure  1  obtained  without  regularization.  Furthermore  the  stopping 
criterion  ||g||2  <  10“^  was  satisfied;  in  the  previous  experiment  using  no 
regularization  the  gradient  norm  grew  proportional  to  the  condition  number. 

In  the  right  panel  of  Figure  3  we  see  that  the  regularization  term  limits  the 
growth  of  the  weights  compared  to  Figure  1.  Some  however  still  grow  large  as 
does  the  condition  number  shown  in  the  left  panel  of  Figure  4.  Even  though 
the  condition  number  grows  to  10®  the  damped  Gauss-Newton  method  still 
manages  to  find  a  minimum.  Experience  shows  that  for  this  method  successful 
training  to  a  (local)  minimum  can  be  obtained  for  condition  numbers  up  to 
about  10®  in  magnitude.  This  may  depend  on  the  decomposition  algorithm 
used  when  solving  (4),  here  we  use  the  fast  and  stable  Cholesky  factorization 
[3].  Prom  the  right  panel  of  Figure  4  we  learn  that  the  reduction  in  condition 
number  is  obtained  only  from  an  increase  in  the  smallest  eigenvalues  resulting 
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Figure  4:  Training  with  regularization,  a  =  10“^.  Left  panel:  Evolution  of  the 
condition  number  for  H.  Right  panel:  Eigenvalues  after  iterations  7,  20  and  100. 


from  the  regularization.  The  largest  eigenvalues  are  of  the  same  order  of 
magnitude  as  when  training  without  weight  decay,  see  Figure  2.  This  is  due 
to  the  still  fairly  large  weight  magnitudes.  If  the  regularization  term  a  is 
further  increased  the  larger  eigenvalues  will  also  be  affected;  but  so  will  the 
modeling  capabilities  of  the  network,  leading  to  increased  errors. 

In  the  final  experiment  we  compare  the  performance  of  damped  Gauss- 
Newton  with  a  gradient  descent  algorithm  also  using  the  step-size  halving 
line  search.  The  problem  is  still  prediction  of  the  laser  series  but  using  larger 
networks  with  a  single  input  and  nine  hidden  units,  109  weights  in  total  (no 
feedback  from  the  output  to  the  hidden  units).  Thus,  each  iteration  using 
damped  Gauss-Newton  involved  solution  of  a  109  by  109  linear  system  of 
equations.  Six  initial  networks  were  generated  by  initializing  their  weights 
with  values  drawn  from  a  uniform  distribution  over  the  interval  [—0.3;  0.3]. 
The  training  algorithms  were  then  compared  when  starting  from  the  same  six 
initial  networks,  both  using  regularization  a  =  0.02.  The  resulting  evolution 
of  errors  is  shown  in  Figure  5;  in  the  left  panel  we  see  the  resulting  errors 
using  the  damped  Gauss-Newton  method,  in  the  right  panel  using  gradient 
descent.  Using  both  methods  the  stopping  criteria  was  set  to  ||g||2  <  10"'^  or 
maximum  10000  iterations. 


Figure  5:  Evolution  of  errors  using  different  optimization  methods.  Left  panel: 
Damped  Gauss-Newton  method.  Right  panel:  Gradient  descent  with  line  search. 


For  the  damped  Gauss-Newton  method  the  stopping  criterion  was  met  in 
all  six  runs.  The  average  training  error  (Normalized  Mean  Squared  Error)  was 
7.7-10“'^,  the  average  test  error  was  4.9-10~^,  The  average  time  for  a  complete 
training  run  was  200  seconds.  For  gradient  descent  the  stopping  criterion  was 
never  met,  the  termination  of  the  algorithm  in  each  run  was  due  to  maximum 
number  of  iterations  reached.  The  average  training  error  obtained  after  the 
maximum  allowed  10000  iterations  was  4.0  •  10”^,  the  average  test  error  was 
7.8  •  10“^.  The  average  time  used  for  obtaining  these  error  levels  was  8140 
seconds.  Note  that  the  levels  of  both  training  and  test  errors  obtained  using 
gradient  descent  are  much  higher  than  the  levels  obtained  using  the  damped 
Gauss-Newton  method  even  though  gradient  descent  used  a  factor  of  50  times 
more  iterations  and  a  factor  of  40  times  more  computer  time.  Thus,  even 
though  an  iteration  of  the  damped  Gauss-Newton  method  is  computationally 
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more  costly  than  an  iteration  of  gradient  descent,  the  additional  cost  is  highly 
justified  by  the  vastly  increased  convergence  rate.  Similar  justification  has 
been  observed  for  networks  with  up  to  300  parameters. 

CONCLUSION 

In  this  paper  we  have  focused  on  sources  of  ill-conditioning  and  thus  the  need 
for  regularization  when  training  recurrent  networks  especially  using  second- 
order  methods.  Once  this  need  is  recognized  dramatic  improvement  in  con¬ 
vergence  rate  and  quality  of  solution  is  obtained,  even  for  large  size  problems. 

ACKNOWLEDGMENTS 

The  author  would  like  to  thank  Lars  Kai  Hansen  and  Jan  Larsen  for  support. 
This  research  is  supported  by  the  Danish  Natural  and  Technical  Research 
Coucils  through  the  Computational  Neural  Network  Center  (connect). 

REFERENCES 

[1]  Y.  Bengio,  R  Simard  and  R  Prasconi,  “Learning  long-term  dependencies  with 
gradient  descent  is  difficult,”  IEEE  Transactions  on  Neural  Networks, 
vol.  5,  no.  2,  pp.  157-166,  1994. 

[2]  Y.  L.  Cun,  I.  Kanter  and  S.  A.  Solla,  “Eigenvalues  of  covariance  matrices; 
Application  to  neural-network  learning,”  Physical  Review  Letters,  vol.  66, 
no.  18,  pp.  2396-2399,  1991. 

[3]  J.  E.  Dennis  and  R.  B.  Schnabel,  Numerical  Methods  for  Unconstrained 
Optimization  and  Nonlinear  Equations,  Englewood  Cliffs,  NJ:  Prentice- 
Hall,  1983. 

[4]  S.  Haykin,  Neural  Networks,  A  Comprehensive  Foundation,  New  York, 
NY;  Macmillan,  1994. 

[5]  S.  Hochreiter  and  J.  Schmidhuber,  “Long  short  term  memory,”  Tech.  Rep.  FKI- 
207-95,  Fakultat  fiir  Informatik,  Technische  Universitat  Munchen,  Munchen, 
1995. 

[6]  M.  W.  Pedersen  and  L.  K.  Hansen,  “Recurrent  networks;  Second  order  proper¬ 
ties  and  pruning,”  in  G.  Tesauro,  D.  Touretzky  and  T.  Leen,  eds..  Advances 
in  Neural  Information  Processing  Systems,  The  MIT  Press,  1995,  vol.  7, 
pp.  673-680. 

[7]  S.  Saarinen,  R.  Bramley  and  G.  Cybenko,  “Ill-conditioning  in  neural  network 
training  problems,”  SIAM  Journal  on  Scientific  Computing,  vol.  14, 
pp.  693-714,  1993. 

[8]  A.  S.  Weigend  and  N.  A.  Gershenfeld,  eds..  Time  Series  Prediction:  Fore¬ 
casting  the  Future  and  Understanding  the  Past,  Reading,  MA;  Addison- 
Wesley,  1993. 

[9]  A.  S.  Weigend  and  D.  E.  Rumelhart,  “The  effective  dimension  of  the  space 
of  hidden  units,”  in  E.  Keramides,  ed.,  Proceedings  of  INTERFACE’91: 
Computing  Science  and  Statistics,  Springer  Verlag,  1992. 

[10]  R.  J.  Williams  and  D.  Zipser,  “A  learning  algorithm  for  continually  running 
fully  recurrent  neural  networks,”  Neural  Computation,  vol.  1,  pp.  270-280, 
1989. 


364 


COMBINING  DISCRIMINANT-BASED  CLASSIFIERS 
USING  THE  MINIMUM 
CLASSIFICATION  ERROR  DISCRIMINANT 


Naonori  UEDA  Ryohei  NAKANO 


NTT  Communication  Science  Laboratories 
Hikaridai  Seika-cho  Soraku-gun  Kyoto  619-02  Japan 
e-mail:  ueda@cslab.kecl.ntt.jp 


Abstract.  Focusing  on  classification  problems,  this  paper  presents  a 
new  method  for  linearly  combining  discriminant-based  classifiers  to 
the  improve  classification  performance,  in  the  sense  of  the  minimum 
classification  errors.  In  our  approach,  the  problem  of  estimating  lin¬ 
ear  weights  in  combination  is  reformulated  as  the  problem  of  design¬ 
ing  a  linear  discriminant  function  using  the  minimum  classification 
error  discriminant.  In  this  formulation,  because  the  classification 
decision  rule  is  incorporated  into  the  cost  function,  better  combina¬ 
tion  weights  suitable  for  the  classification  objective  can  be  obtained. 
Experimental  results  using  neural  network  classifiers  support  the  ef¬ 
fectiveness  of  the  proposed  method. 


INTRODUCTION 

Compared  with  a  single  estimator,  combining  estimators  has  been  shown  to 
better  improve  generalization  error  [l]-[8].  Approaches  to  combine  estimators 
have  recently  attracted  major  interest  in  the  neural  network  community  be¬ 
cause  of  thier  simplicity  and  theoretical  implications.  The  output  of  the  com¬ 
bined  estimator  for  some  input  is  defined  as  a  linear  combination  of  outputs 
of  multiple  estimators,  where  it  is  assumed  that  each  estimator  is  separately 
constructed  by  using  the  same  training  data. 

In  these  approaches,  how  to  determine  the  linear  weights  is  an  important 
problem  in  practice.  A  naive  way  is  to  employ  simple  averaging  equal 
weights).  Recently,  a  few  methods  have  been  presented  [4] [8]  to  combining 
regressors.  However,  as  we  will  describe  later,  these  methods  are  not  suitable 
for  our  problem  of  combining  classifiers. 

In  this  paper,  we  newly  present  a  way  of  combining  discriminant-based 
classifiers  to  improve  the  classification  performance.  In  our  approach,  the 
problem  of  estimating  linear  weights  in  combination  is  reformulated  as  the 
problem  of  designing  linear  discriminant  functions  using  the  minimimn  classifi¬ 
cation  error  discriminant.  Since  the  decision  rule  is  directly  incorporated  into 
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the  cost  function  in  this  formulation,  we  can  obtain  linear  weights  with  which 
a  better  combined  classifier  can  be  constructed  for  achieving  the  minimum 
misclassification  probability. 

This  minimum  classification  error  discriminant  approach  has  successfully 
been  utilized  to  train  individual  discriminant  functions  [9]  [10].  We  newly 
present  how  this  criterion  can  be  used  to  estimate  the  linear  weights  in  com¬ 
bining  discriminant-based  classifiers. 


PROBLEM  FORMULATION 

Let  X  be  an  observation  vector,  and  let  it  be  our  purpose  to  assign  x  to  one 
of  K  classes.  A  decision  rule  in  terms  of  discriminant  functions  is  written  as 
follows: 

C{x)  =  k  if  =  max/^^^(x).  (1) 

3 

Here  C7(-)  denotes  a  classification  operation  and  is  the  discriminant 

function  for  class  k.  Thus,  a  classifier  consists  of  K  discriminant  functions. 
The  simplest  instance  is  the  linear  discriminant  function  specified  by  weight 
vector  w  as  f{x)  =  w'^x,  where  T  denotes  the  transposition.  Neural  network 
classifiers  might  be  the  most  complicated.  In  the  case  of  a  neural  network 
classifier,  the  kth  output  unit  corresponds  to  the  discriminant  function  for 
class  k. 

Suppose  we  have  M  available  classifiers,  which  were  trained  using  the  same 
training  data  V  —  {{xi, C{xi)y,  i  =  1, . . . , N}.  Let  fm\x;  V)  denote  the  out¬ 
put  of  the  mth  discriminant  function  for  class  k  for  some  input  x  after  it  has 
been  trained  on  V.  Then,  in  an  analogous  manner  to  the  original  definition 
of  the  linear  combination  of  multiple  regressors  [2] ,  the  combined  discriminant 
function  for  each  class  can  be  defined  as  the  linear  combination  of  all  KM 
discriminant  functions: 

k  =  \,...,K.  (2) 

Here, 

. f[’^\x-,Vl...,f^\x-,V),...,f'^^\x-,V)f  (3) 

and  therefore  is  a  KM-dimensional  column  vector.  Another  definition 
such  as  f^om  =  Z)m=i  V),  where  fc  =  1, . . . ,  AT  is  possible,  but  our 

definition  is  more  flexible  and  general. 

Let  y{x)  =  (^f comix), . . .  ,/com)  be  a  classifier  vector.  Then,  (2)  can  be 
written  as 

y{x)  =  W'^fix;V),  (4) 

where  W  =  . . . ,  Therefore,  one  can  see  that  (4)  indicates  a  linear 

mapping  fi*om  a  /  G  7^^^-space  to  a,  y  €  72.^-space. 
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Now  our  goal  here  is  to  estimate  where  k  =  to  design  a 

combined  classifier  achieving  the  minimum  classification  error  probability. 


PREVIOUS  WORK 
Variance-based  Weighting 

Theoretically,  if  the  prediction  values  of  individual  estimators  are  uncorrelated 
and  unbiased,  the  combined  prediction  value  becomes  unbiased.  Moreover, 
the  smallest  generalization  error  at  x  can  be  attained  only  if  the  weighting 
function  is  selected  as  the  inverse  of  the  variance  of  the  prediction  values  at  x 
[2]  [4]  [7].  Tresp  et  al  [4]  proposed  a  non-constant  weighting  function  defined 
as  the  inverse  of  the  variance  depending  on  x.  This  method  is  promising  in 
problems  involving  the  usual  combining  of  regressors  denoted  as  gcom{x)  = 
Z)m=i  ^)*  definiton  for  combination  (2),  however,  for 

all  k  the  same  linear  weights: 

X* 

a^^\x)  oc  ^l/var(/i^^(®;  P)), . . . ,  l/var(/^\a;;  I?))^ 

are  assigned  and  as  a  result  /iom(®)  =  fcom(x)  =  •  •  •  =  fLm(x)-  clearly 

inappropriate.  This  might  be  modified  by  considering  the  bias  of  fm 
but  generally  it  is  much  more  difficult  to  estimate  the  bias.  Consequently,  the 
variance-based  weighting  method  is  not  directly  applicable  to  our  task. 


Stacked  Regression  (MSE-criterion) 

Let  6(a;)  be  a  K  dimensional  column  vector  whose  kth  component  is  one;  and 
the  others  are  zero  if  C{x)  =  k.  Let  X  be  a  random  variable  having  the 
distribution  p{x).  In  a  regression  setting,  the  mean  squared  error  (MSE)  of 
combined  classifier  fcom  is  written  as 

MSE(/c„„.)=Ex|||6(A-)-(a«’>{X;P),  . . . ,  aW’>(X;P))’’||  |. 

^  (5) 

Therefore,  one  way  for  estimating  the  weights  with  V  is  just  to  take 
where  ik  =  1, . . . ,  K  to  minimize 


. a I?)) ' 

^  t=l 


(6) 


However,  as  pointed  out  by  Breiman  [5],  since  25  is  utilized  both  in  the  training 
of  each  /m^  and  in  the  estimation  of  a,  the  obtained  /iom  will  overfit  the 
training  data  V.  This  will  result  in  poor  generalization.  He  also  presents 
a  method  called  stacked  regression  to  fix  this  problem,  based  on  the  idea  of 
stacked  generalization  by  Wolpert  [1]. 
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/JL\ 

Let  denote  the  cross-validated  output  evaluated  at  the  left- 

out  a  data  Xi  after  f{n^  has  been  trained  on  P(_i)  =  X>  -  {(®i,C(a;i))}.  The 
stacked  regression  estimates  of  the  weights  can  be  obtained  by  minimizing 

»=1  ■'  II 

The  above  MSE  solution  can  be  easily  obtained  as 

(8) 

where 

. . . ,  /U>(xi;  ,  /j^>  (xi-,  (9) 

Looking  at  (2)  carefully,  one  can  see  that  is  a  linear  discriminant  func¬ 
tion  in  an  /-space.  According  to  the  theory  of  statistical  pattern  recognition 
[11],  i.€.,  canonical  discriminant  analysis,  the  MSE  criterion  product  the  same 
solution  as  Fisher’s  linear  discriminant  criterion.  In  other  words,  a  combined 
classifier  based  on  the  MSE  approach  is  exactly  equivalent  to  Fisher’s  linear 
discriminant  function.  It  is  known  that  Fisher’s  criterion  works  well  as  a  class 
separability  measure  under  the  assumption  that  the  distribution  of  each  class 
is  normal.  In  our  problem,  however,  this  assumption  does  not  always  hold  and 
therefore  it  is  clear  that  the  MSE  criterion  does  not  always  lead  to  the  optimal 
solution  in  the  Bayes  sense. 

ESTIMATING  WEIGHTS  USING  THE  MINIMUM 
CLASSIFICATION  ERROR  DISCRIMINANT 

Formulation 

As  mentioned  in  the  previous  section,  since  f^om  can  be  seen  as  a  linear  dis¬ 
criminant  function  parameterized  by  in  an  /-space,  the  problem  of  esti¬ 
mating  the  linear  weights  is  exactly  equivalent  to  the  problem  of  designing  a 
linear  classifier  in  an  /-space.  Thus  our  problem  is  reduced  to  designing  K 
linear  discriminant  functions  /,  . . . ,  firom  cross-validated  data 

^  ^  =  1>  •  •  •  j N}-  Note  that  once  D'  is  constructed, 

we  do  not  need  D  any  more  to  estimate  W.  To  emphasize  this,  we  write 
^  z  =  1, . . . ,  AT}  by  setting  /^  =  /(»t;2>(-t))  and 

According  to  a  conventional  pattern  classifier’s  design,  W  is  found  as  a 


minimizer  of  the  following  expected  loss: 


=  J2  I ik{f\W)p{f  I C{f)  =  k) df,  (10) 

fe=i'' 


where  Pk  is  the  prior  probability  of  class  k  and  denotes  the  loss 

caused  by  misclassification  of  observation  f  when  C{f)  =  k.  Note  that  loss  Ik 
depends  on  because  a  classification  with  a  combined  classifier 

is  done  by  a  decision  rule  as:  C{x)  =  k  if  fcSn{x)  =  maxj  fcom{x). 

Since  we  do  not  have  any  knowledge  about  the  distribution  of  /,  we  cannot 
directly  minimize  L{W).  In  practice,  we  minimize  the  following  empirical 
average  loss  using  7^: 


N  K 


(11) 


i=l  k=l 


where  1{U)  is  1  if  2/  is  true;  0  otherwise. 


Minimum  Classification  Error  Discriminant 

The  most  popular  loss  function  is  the  zero- one  loss  having  1  for  misclassificar 
tion  and  0  for  correct  classification.  This  loss  function  is  discontinuous  at  the 
decision  boundary.  To  derive  a  gradient  algorithm,  a  smoothed  loss  function 
[10]  in  the  form 


lk{f;W)  =  lk{dk(f\W)) 

=  (12) 

is  suitable.  Here,  dk(f;  W)  is  the  misclassification  measure  which  enumerates 
how  likely  /  is  misclassified  as  another  class.  The  introduction  of  a  misclassi¬ 
fication  measure  to  the  loss  function  originated  with  Amari  [9]  and  Juang  h 
Katagiri  [10].  We  employ  Juang  &  Katagiri’s  definition: 

4(/;W)  =  -aW^/+[^X( 

Here  77  is  a  positive  constant.  Note  that  when  77  00,  (13)  becomes 

4(/;  W)  =  +  maxa^^)^/.  (14) 

Clearly,  dfc  <  0  means  correct  classification  and  dfc  >  0  indicates  misclassifica¬ 
tion.  The  above  lk{d)  is  a  smoothed  version  of  the  conventional  zero-one  loss, 
indeed  lk{d)  monotonically  approaches  0  (1)  as  d  decreases  (increases). 
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With  these  definitions,  L'(W)  is  now  well-defined.  Consequently,  for  the 
initial  estimate  W(0),  the  weight  matrix  W  can  be  iteratively  estimated  by 
using  the  following  probabilistic  descent  algorithm. 


K 


W{t  + 1)  =  Wit)  -  emYiliCif)  =  k) 

k=l 


dlk{f;W) 

dw 


W=W{t) 


(15) 


Here, 

dlk  ^  (  dlk  dlk  \  ^KMxK 

dW  “  Vaa(i)’'”’aaW/ 


(16) 


C7  is  a  small  positive  definite  matrix,  W{t)  is  the  estimate  at  the  tth  step. 
Prom  the  theory  of  stochastic  approximation  [12],  it  is  guaranteed  that  W{t) 
will  converge  to  at  least  a  local  optimum  solution  if  c(t)  satisfies  the  following 
conditions: 


lim  e{t)  =  0, 

t— KX) 


OO  OO 

c(t)  =  OO,  and  e{t)^  <  oo. 

t=i 


One  can  see  that  the  decision  rule  is  incorporated  into  the  overall  cost  function 
L'{W).  Therefore,  we  can  obtain  linear  weights  with  which  a  better  combined 
classifier  can  be  constructed  for  achieving  the  minimum  misclassification  prob¬ 
ability. 


EXPERIMENTS 
Two  Class  Problem 

We  empirically  compared  our  approach  with  the  stacked  regression  approach 
using  two  class  problems  for  simplicity.  The  training  data  set  T)  =  {aJi,  C{Xi)j  i  = 
1, . . . ,  100}  was  artificially  generated  from  the  following  two-dimensional  nor¬ 
mal  distribution: 

Classl:^((-0.5  0.5)^,(Q^  ))+^  (  0^5  ^f))’ 

Class  2:  ((0.7  -  0.7)^,  (  )) • 

In  the  two  class  problems,  one  discriminant  function  for  each  class  is  suf¬ 
ficient,  because  by  assuming  0  <  /com(®)  ^  1)  decision  rule  can  be  writ¬ 
ten  as  C(x)  =  1  if  fcom(x)  >  0.5;  C(x)  =  2  otherwise.  Thus,  (2)  is 
reduced  to  fcom{x)  =  a^f{x]V).  Therefore,  the  simple  averaging  method 
(a  =  (l/3.1/3,l/3)^)  was  also  compared.  In  this  experiment,  as  individ¬ 
ual  discriminant  functions,  we  employed  feedforward  neural  network  classifiers 
consisting  of  two  input  units,  H  hidden  units,  and  one  output  unit.  We  set 
H  —  3, 9, 15  and  therefore 

W  =  a  =  (ai,a2,a3)^  and  f{x;V)  =  {fi{x;T>),  f2(x\V),  fz{x\V)f . 
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Moreover,  in  this  case,  (13)  was  modified  as 

di{f;W)  =  0.5-a^f  and  d2(f-,W)  =  a^f-0.5. 

We  set  ^  =  10  and  r  =  0. 

Figure  1  shows  the  classification  results  obtained  by  three  individual  neural 
network  classifiers.  As  one  can  easily  guess,  the  simplest  and  the  most  complex 
cl£iss  boundaries  were  obtained  when  H  =  S  and  H  =  15,  respectively.  The 
training  and  test  errors  obtained  by  each  classifier  are'  shown  in  Table  1.  A 
test  data  set  consisting  of  3000  points  per  class  was  also  artificially  generated 
independently  of  V  fi:om  the  above  distributions. 

Figure  2  shows  classification  results  obtained  by  combined  classifiers  with 
different  weighting  methods,  ie.,  simple  averaging  (Method  1),  stacked  re¬ 
gression  (Method  2),  and  the  proposed  method  (Method  3).  Table  2  com¬ 
pares  the  classification  performance  of  the  three  methods.  The  estimated 
weights  by  Method  2  were  OLMethod2  —  (0.18  0.49  0.33)^,  while  otMethodz  — 
(0.55  0.16  0.29)^  by  Method  3.  For  Method  1,  ttMet/iodi  =  (0.33  0.33  0.33)^. 
Since  a  classifier  with  H  =  Sis  thought  to  produce  the  best  classification  per- 
formace,  intuitively  otMethods  (the  first  component  value  is  much  larger  than 
the  others)  can  be  thought  of  as  the  best. 

Table  2  indicates  that  the  stacked  regression  approach  did  not  work  well  (it 
was  worse  than  the  simple  averaging  method)  and  that  the  proposed  method 
produced  the  smallest  classification  error  among  the  three  methods.  More¬ 
over,  looking  at  Tables  1  and  2,  one  can  see  that  Methods  1  and  2  certainly 
improved  the  classification  performance,  but  the  test  errors  obtained  from  the 
combined  classifiers  were  larger  than  that  of  the  best  individual  classifier  (le., 

=  3).  On  the  other  hand,  in  the  case  of  the  proposed  method,  the  classifi¬ 
cation  performance  of  the  combined  classifier  was  better  improved  than  those 
of  all  individual  classifiers,  at  least  in  this  experiment.  This  can  be  seen  in 
Figure  2  where  the  class  boundaries  obtained  by  the  proposed  method  appear 
to  be  a  smoothed  collection  of  locally  desired  boundaries  obtained  by  individ¬ 
ual  classifiers. 


Real  World  Data 

We  also  applied  our  combination  method  to  a  handwritten  digit  recognition 
problem  (10  class  problem)  as  a  real-world  case.  The  training  and  test  data 
consisted  of  200  points/class.  Each  data  point  was  a  16  dimensional  real  vector. 
In  this  experiment,  we  used  three  neural  networks  with  different  weight  decay 
parameters  (A  =  0.6,  0.3,  0.15).  The  number  of  hidden  units  was  H  =  20. 

Table  3  shows  the  misclassification  error  obtained  by  each  network.  As 
well  known,  when  A  was  too  large  (small),  the  obtained  class  bounundaries 
under-  (over-)  fitted  to  the  trainig  data.  Performing  an  exhaustive  line  search, 
we  found  that  A  =  0.3  produced  the  smallest  test  error. 
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Figure  1:  Classification  results  of  neural  network  classifiers.  H  is  the  number 
of  hidden  units,  o  and  +  are  class  1  and  class  2  sample  points,  respectively. 
The  solid  lines  denote  class  boundaries. 

Table  1:  Classification  performance  of  individual  neural  network  classifiers.  H 
is  the  number  of  hidden  units. 


H  =  3 

i7  =  9 

if  =  15 

Training  error  (%) 

13.0 

11.0 

13.0 

Test  error  (%) 

17.0 

23.0 

22,9 

Figure  2:  Results  of  combined  classifiers  by  simple  averaging  (Method  1),  the 
MSE  criterion  (Method  2),  and  the  MCE  criterion  (Method  3). 

Table  2:  Classification  performance  of  three  combining  methods:  simple  aver¬ 
aging  (Method  1),  stacked  regression  (Method  2),  and  the  proposed  method 
(Method  3). 


Method  1 

Method  2 

Method  3 

Training  error  (%) 

12.0 

12.0 

12.0 

Test  error  (%) 

18.3 

20.6 

16.0 
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Table  3:  Classification  performance  of  individual  neural  network  classifiers.  A 
is  the  weight  decay  parameter. 


A  =  0.6 

A  =  0.3 

iO 

1 — 1 

o 

II 

Training  error  (%) 

5.5 

3.9 

3.0 

Test  error  (%) 

8.8 

8.1 

8.6 

Table  4:  Classification  performance  of  combined  classifiers. 


MSE 

MCE 

Training  error  (%) 

2.7 

3.4 

Test  error  (%) 

8.9 

7.6 

The  results  for  combined  classifiers  are  compared  in  Table  4.  One  can  see 
that  the  test  error  obtained  by  the  MCE  criterion  was  less  than  that  by  the 
MSE  criterion.  Moreover,  the  combined  classifier  using  the  MCE  criterion 
outperformed  the  best  single  classifier  (A  =  0.3). 


CONCLUSION 

We  have  presented  a  new  way  for  linearly  combining  discriminant-based  classi¬ 
fiers  using  the  minimum  classification  error  discriminant.  One  important  point 
is  to  reformulate  the  weight  estimation  problem  in  combination  as  the  design 
problem  of  a  linear  classifier  in  an  /-space.  This  also  made  us  understand 
why  the  stacked  regression  (MSE)  approach  is  not  suitable  for  the  minimum 
classification  error  objective,  and  motivated  us  to  invent  our  new  approach 
suitable  for  the  objective.  In  our  approach,  like  in  the  stacked  regression  ap¬ 
proach,  cross-validated  stacked  data  is  used  to  estimate  the  linear  weights.  In 
this  sense,  we  may  call  our  approach  stacked  classification. 

In  our  experiments  using  combined  neural  network  classifiers,  we  confirmed 
the  potential  advantage  of  our  method.  Some  of  the  future  research  issues 
involve  the  classifier  selection.  In  neural  network  classifiers,  it  would  be  better 
to  combine  networks  with  different  weight  decay  parameters  than  those  with 
different  number  of  hidden  units.  However,  how  to  select  the  weight  decay 
parameters  in  combination  is  unsettled  in  this  paper.  That  is,  we  selected 
weight  decay  parameters  in  some  ad  hoc  manner.  This  practically  important 
problem  remains  as  a  future  task. 
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The  paper  presents  a  pruning  scheme  for  the  hierarchical  mixtures  of  experts 
(HME),  which  is  a  hierarchical  and  tree-like  modular  neural  network 
trained  using  the  EM-algorithm.  The  pruning  scheme  is  in  the  style  of 
CART’S  pruning  scheme,  and  consists  of  using  cross-entropy  to  select  and 
cut  out  sub-trees  of  the  HME  to  create  a  series  of  nested  HMEs.  The  right 
sized  HME  can  then  be  selected  by  using  cross-validation.  Experiments  are 
carried  out  to  demonstrate  the  successful  operation  of  the  scheme. 


1.  INTRODUCTION 

The  class  of  modular  networks  called  the  hierarchical  mixtures  of  experts  (HME) 
was  first  introduced  by  [1].  The  HME  is  divide-and-conquer  method,  that  does  not 
suffer  from  the  slow  learning  problems  of  single  neural  networks,  but  neither  does 
it  suffer  from  the  large  bias  of  fast-learning  tree  methods  like  the  Classification 
and  Regression  Tree  (CART)  [2].  This  paper  is  concerned  with  determining  the 
correct  size  of  the  HME,  which  is  crucial  as  it  is  concerned  with  obtaining  the 
optimal  bias-variance  dilemma  [3].  Breiman  et  al,  [2]  uses  a  pruning  scheme  on 
the  CART  after  it  has  grown  to  an  excessively  large  size,  and  determines  the 
structure  of  the  tree.  In  this  paper  the  same  idea  is  applied  to  the  HME  to 
determine  its  structure,  rather  than  using  regularisation  [4]  or  stopping  training 
early  to  control  the  complexity  [1].  Although  [4]  developed  a  pruning  scheme  for 
the  HME,  it  uses  a  threshold  to  select  the  sub-tree,  and  data  can  be  accumulated 
down  the  sub-tree  at  a  later  time  step.  The  method  presented  here  does  not  rely  on 
a  threshold  and  permanently  prunes  the  sub-trees  of  the  HME. 

The  paper  begins  by  introducing  the  HME  and  its  training  in  section  2.  In  section 
3  the  pruning  and  merging  scheme  is  explained,  and  the  results  obtained 
presented  in  section  4.  Finally  conclusions  and  discussions  are  given  in  section  5. 
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Figure  1  The  Hierarchical  Mixtures  of  Experts 


2.  HIERARCHICAL  MIXTURES  OF  EXPERTS  (HME) 


The  HME  is  a  modular  neural  network,  which  consists  of  several  expert  networks 
assigned  to  different  parts  of  a  task  by  a  hierarchy  of  gating  networks  (Figure  1). 
The  gating  networks  weight  the  outputs  of  the  expert  networks,  yf\  to  produce  a 
prediction  of  the  output  given  by: 


(1) 


where  the  sum  is  over  each  non-terminal  node  w/ ,  and  rrik ,  and  Oj  ,r}i,  ^ are  the 
parameters  of  the  expert  and  gating  networks,  and  are  the  notational 
shorthands  for yi[x^^^]  and  gi[x^^^]  respectively,  and  are  the  individual  samples  of;;, 
and  gi.  gf^  is  the  gating  factor  or  output  of  the  gating  network  after  it  has  been 
passed  through  the  binomial  logistic  function: 


n=] 


(2) 


where  5/^  is  the  ith  output  of  the  gating  network. 

To  train  the  HME,  the  EM-algorithm  is  used  [5];  it  is  an  iterative  optimisation 
method  that  ideally  suits  the  modular  structure  of  the  HME.  The  EM-algorithm 
for  the  HME  works  by  iterating  between  estimating  the  posterior  probabilities. 
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hi(t)  of  the  gating  factors,  and  training  the  expert  and  gating  networks.  The 
posterior  probabilities  for  the  lowest  level  gating  networks  just  above  the  expert 
networks  in  a  two-level  HME  are: 


k 

and  the  posterior  probabilities  in  the  top-level  gating  network: 

/,(')= - Li - L 

'  L  * 


(3) 


(4) 


Pk  is  the  conditional  likelihood  for  each  expert  network: 


where  y  is  the  target  output,  Rk  the  covariance  matrix  for  the  kth  expert,  d  its 
dimension,  and  is  the  estimation  of  the  output  by  the  kth  expert. 

For  training  the  linear  expert  networks  within  the  iteration  of  the  EM-algorithm, 
weighted-least-squares  is  used: 

wj  =  {XHXy  XHY’'  (6) 

where  X  is  the  matrix  of  explanatory  variables: 
x{t-2)  •••  x{t-l-T) 

x(t-2)  x{t-3)  x{t~T-2)  (7) 

x[t-n)  x(t-n~l)  •••  x(t-n-T) 

and  Y  is  the  matrix  of  regressands: 

Y  =  [y{t)  -  y{t-T)]  (8) 

and  H  is  the  diagonal  matrix  of  posterior  probabilities  across  time,  divided  by  the 
standard  deviation: 
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H  ~  diag 


(9) 


^kli  ^kji  ^k/i 

However,  a  non-linear  optimisation  method  like  Iterative  Weighted-Least-Squares 
(IRLS)  [1]  is  used  for  training  the  gating  networks,  even  when  the  gating  network 
itself  is  linear,  because  of  the  binomial  logistic  function. 


3.  PRUNING  SCHEME 

The  pruning  scheme  proposed  consists  of  training  a  large  HME  until 
convergence,  pruning  out  the  worst  sub-tree  in  the  HME,  and  then  reiterating  the 
training-pruning  cycle  until  there  is  only  one  expert  left  on  the  HME.  The  HME 
with  the  smallest  validation  error  is  then  selected  out  of  the  series  of  HMEs 
created  by  the  pruning. 

The  criterion  for  selecting  the  worst  sub-tree,  is  that  the  root  gating  network  of 
the  sub-tree  has  the  largest  cross-entropy: 

“S  [gf  ^]+ -  gf 


The  reason  for  using  cross-entropy  is  because  it  is  a  combination  of  entropy  on  the 
posterior,  plus  the  Kullback-Leibler  number  [6]; 


The  Kullback-Leibler  number  is  a  measure  of  how  close  is  to  over  all  the 
samples,  and  is  zero  when  they  are  identical.  An  alternative  view,  is  that  it 
represents  the  amount  of  information  loss  that  has  occurred  by  the  gating 
neLvork’s  attempt  to  learn  the  posterior.  The  entropy  measures  how  sharply  the 
posterior,  hP  of  the  gating  divides  the  input  space.  When  the  gating  network  is 
selected  for  pruning,  it  is  chosen  because  it  splits  up  the  input  space  the 
smoothest,  but  has  also  captured  the  posterior  information. 


Once  the  worst  sub-tree  has  been  selected,  it  can  then  be  pruned  out,  and  a  new 
expert  network  added  in  its  place.  The  simplest  way  to  initialise  the  new  expert  is 
by  randomly  generating  the  weights,  but  it  can  cause  instability  in  the  training 
after  the  pruning.  A  better  method  is  merging  or  the  weighted  average  of  the 
expert’s  weights  in  a  sub-tree,  found  by  feeding  the  weights  of  the  experts  into  the 
sub-tree  instead  of  inputs,  and  averaging  over  all  samples: 
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(12) 


=lld 


[1=1 


lg\ 

k=l 


kiu^kiu 


J 


Initialisation  of  variance  is  also  important  in  the  merging  scheme,  and  all  experts’ 
standard  deviations  should  be  reinitialised  to  a  large  value  so  that  they  can 
compete  fairly  after  merging. 

The  problem  with  the  merging  scheme  is  that  it  can  not  be  performed  if  the 
experts  are  non-linear.  It  also  does  not  produce  an  expert  that  is  a  mean-squares 
fit  to  the  data  in  its  local  region,  even  when  they  are  linear,  and  so  further 
training  of  the  whole  HME  will  still  have  to  be  performed  after  the  merging. 

The  freezing  method  overcomes  the  problems  of  merging,  by  freezing  all  the 
weights  and  standard  deviations  in  the  HME,  except  for  the  new  expert’s  weights 
and  standard-deviation.  The  new  expert’s  weights  and  standard  deviation  are  then 
trained  iteratively,  between  the  expert’s  weights  and  its  standard  deviation  as  in 
weighted  regression.  If  the  experts  are  linear  the  weighted  regression  becomes  an 
iterated  weighted-least-squares,  and  the  weights  are  obtained  by: 

w^(nf  ={XGXy  XGY-  (13) 


where  X  and  Y  are  defined  as  in  (7)  and  (8)  and  G  is  the  weighting  matrix: 


G  =  diag 


Sk,i  SkJ 


I  \/i  ^// 


(14) 


and  gk^i  is  the  multiplication  of  all  the  conditional  gating  factors,  down  the 
path  to  the  expert  network: 


Sk..M  Si  Slji  • 


•SkI.M 


(15) 


The  standard  deviation,  cr*  is: 


.  f 

I  f=i 

The  gating  factor,  g,  is  used  to  weight  the  least-squares  to  make  sure  that  the  new 
expert  is  only  fitted  in  the  local  region  covered  by  the  previous  two  experts  that 
have  been  replaced.  The  standard  deviation  has  to  be  estimated  iteratively  in  the 
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weighted  regression  because  it  is  unknown  a  priori,  and  is  not  necessarily  the 
same  as  the  previous  two  experts. 

After  pruning,  the  optimal  size  of  tree  is  then  selected  by  validation.  This 
validation  can  be  on  a  single  validation  set  using  one  series  of  pruned  HMEs. 
Alternatively,  it  can  be  done  by  selecting  the  majority-optimal  sized  HME  from  n 
separate  series  of  pruned  HMEs  trained  on  n  separate  cross-validation  sets  (n-fold 
cross-validation).  The  majority-optimal-sized  HME  is  the  HME  that  minimises  its 
cross-validation  error  the  most  times. 


4.  EXPERIMENTAL  RESULTS 

The  experiments  use  the  Sunspot  time-series  (a  benchmark  time-series),  which  is 
the  yearly  activity  of  the  sun  from  1700  to  1979,  and  tackled  by  [7],  using  neural 
networks.  The  actual  result  of  10-fold  cross-validation  for  one  trial  is  shown  in 
each  column  of  (Table  1)  as  one  trial.  Each  trial  shows  that  the  two-expert  HME 
predominates.  Trials  two  and  three  use  the  same  initial  weights  for  the  HMEs,  so 
that  the  freezing  and  merging  methods  can  be  compared;  the  freezing  method  is 
more  consistent.  10-fold  cross-validation  was  used  rather  than  5-fold  cross- 
validation  which  was  found  to  be  inconsistent.  The  gating  and  expert  networks 
are  all  linear,  with  lag  vectors  of  12  steps  as  the  input.  A  lag  vector  of  12  steps  is 
suitable  because  the  periodicity  of  the  time-series  is  approximately  12  years,  and 
for  the  comparison  with  other  benchmark  results.  The  HME  was  used  to  generate 
one-step  ahead  prediction  and  the  normalised  mean-squared  error  (NMSE). 
NMSE  as  defined  by  [7],  is  the  mean-square  error  normalised  by  the  variance 
from  the  whole  of  the  sunspot  data  (1700-1979). 


TABLE  1 


1st  trial 
merging 

2nd  trial 
merging 

2nd  trial 
freezing 

2  Expert  Networks 

4 

3 

4 

3  Expert  Networks 

3 

3 

1 

The  number  of  times  different  sized  networks  are  selected  by  the  cross- 
validation  and  pruning  in  three  separate  cross-validation  trials.  Each 
trial  represents  10  pruning-training  cycles  trained  by  10-fold  cross- 
validation.  The  figures  quoted  are  the  number  of  times  that  a  particular 
size  of  HME  minimises  its  cross-validation  error  in  a  trial.  Results  for 
any  HME  larger  than  3  experts  in  size  are  not  illustrated  in  the  table, 
because  the  pruning  and  cross-validation  never  selects  these  sizes  more 
than  once  in  a  given  trial. 
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To  compare  the  performance  of  the  different  HME  structures,  the  HMEs  produced 
by  the  pruning-scheme  were  then  trained  on  the  whole  of  the  training  data,  and 
their  error-rates  compared  on  the  training  and  two  validation  sets.  In  (Figure  2) 
both  the  average,  minimum  and  maximum  NMSE  is  worst  for  the  2-gate  HME. 
However,  the  2-gate  NMSE  performs  the  best  on  the  two  test  sets  (Figure  3), 
demonstrating  that  the  pruning-scheme  has  selected  the  best  size  of  HME. 


Figure  2  The  NMSE  of  the  different  sized  HMEs  on  the 
training  set  (sunspot  1700-1920).  is  the  average 
NMSE,  and  is  Ae  maximum  and  minimum  NMSE. 


Figure  3  The  NMSE  of  different  sized  HMEs  on  the  test  sets  (a)  1921-1955  (b) 
1956-79.  is  the  average  NMSE,  and  the  minimum  or  maximum  NMSE. 


The  NMSE  for  our  pruning  scheme  is  lower  on  the  (1956-79)  test  set  than  the 
other  methods  (MLP,  TAR  etc),  and  nearly  equal  with  its  own  result  on  the 
(1921-1955)  test  set  (Table  2).  This  is  despite  the  NMSE  being  nearly  twice  as 
large  as  the  NMSE  on  the  training  set  and  the  first  test  set  (1921-55)  when 
compared  with  the  other  methods.  The  pruned  HME  developed  here  generalises 
better  across  both  test  sets. 
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TABLE  2 


Method  of 
Training 

Test  Set  1 

(1921-1955) 

NMSE 

Test  Set  2 

(1956-1979) 

NMSE 

MLP 

0.0862 

0.086 

0.35 

TAR 

0.097 

0.097 

0.28 

HME 

0.061 

0.089 

0.27 

Linear 

0.0984 

0.184 

0.366 

HME 

pruning 

0.117 

0.190 

0.208 

The  results  of  the  HME  pruning,  when  compared  with  other  results 
trained  and  tested  on  the  same  data:  the  MLP,  TAR  and  Waterhouse’s 
regularised  HME  (with  error-bars).  The  results  are  given  in  the  mean- 
square  error  normalised  by  the  variance  from  the  whole  data  (IsIMSE) 


5.  DISCUSSION  AND  CONCLUSION 

The  pruning  scheme  results  in  providing  the  best  generalising  HME,  with  regard 
to  NMSE  on  the  whole  Sunspot  data.  It  saves  on  some  of  the  computational 
burden  that  would  be  required  if  each  size  and  structure  of  HME  was  selected  and 
trained  separately.  The  saving  is  despite  the  pruning-scheme  using  the  cross- 
validation  itself  with  its  high  computational  burden,  because  the  pruning  process 
eliminates  many  of  the  possible  HME  structures  that  straight  cross-validation 
without  pruning  uses.  Of  the  two  methods  of  initialising  expert  networks,  the 
freezing  method  is  superior  to  the  merging  method,  as  it  provides  consistent 
results. 

Pruning  is  also  a  compliment  to  the  growing  method.  In  the  CART,  the  algorithm 
grows  the  tree,  then  prunes  back,  so  that  the  input  space  is  divided  as  far  as 
possible,  and  then  the  weakest  branches  can  be  pruned  off  In  this  paper  the 
growing  idea  has  been  approximated  by  having  a  larger-than-necessary  HME,  but 
currently  a  growing  method  to  compliment  the  pruning  method  is  being 
investigated. 
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Blind  source  separation  is  the  problem  of  recovering  independent  signals  from  their 
mixtures  when  only  the  mixtures  are  observed.  In  its  simplest  form,  the  data  model 
is  x=As  where  s  is  a  vector  of  independent  sources,  matrix  A  is  the  unknown  'mix¬ 
ing  matrix'  and  x  is  the  vector  of  observations:  the  'mixtures'.  It  is  fair  to  say  that 
this  problem  is  now  well  understood.  In  this  presentation  we  consider  the  problem 
of  recovering  the  source  signals  from  (noisy)  observations.  This  is  modeled  as 
x=As+n  where  vector  n  represents  an  additive  noise,  independent  from  the  signals. 
There  are  many  issues  related  to  the  noisy  problem,  which  make  it  significantly  dif¬ 
ferent  from  the  noise-free  problem.  This  talk  reviews  what  is  known  about  the 
noisy  case  and  presents  new  results.  The  following  points  are  addressed: 

-  What  is  the  optimal  filter  to  recover  unknown  sources  from  noisy  observations? 

-  How  to  use  high  order  information  to  define  contrast  functions  which  remain 
consistent  in  the  presence  of  noise. 

-  What  does  the  likelihood  have  to  say  in  a  noisy  context? 

-  Learning  noisy  mixtures  with  the  EM  (Expectation-Maximization)  algorithm 
and  its  stochastic  variants 

-  The  tricky  continuity  between  noisy  and  noise-free  optimal  algorithms. 

-  Achievable  performance  in  the  presence  of  noise. 

-  When  is  a  noise-free  algorithm  appropriate  to  deal  with  noisy  observations? 

-  Is  it  really  worth  to  care  about  observation  noise? 
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ABSTRACT 

The  author  introduced  previously  a  large  family  of  one-unit  contrast  functions 
to  be  used  in  independent  component  analysis  (ICA).  In  this  paper,  the  family 
is  analyzed  mathematically  in  the  case  of  a  finite  sample.  Two  aspects  of  the 
estimators  obtained  using  such  contrast  functions  are  considered:  asymptotic 
variance,  and  robustness  against  outliers.  An  expression  for  the  contrast 
function  that  minimizes  the  asymptotic  variance  is  obtained  as  a  function 
of  the  probability  densities  of  the  independent  components.  Combined  with 
robustness  considerations,  these  results  provide  strong  arguments  in  favor  of 
the  use  of  contrast  functions  based  on  slowly  growing  functions,  and  against 
the  use  of  kurtosis,  which  is  the  classical  contrast  function. 

1.  INTRODUCTION 

Independent  Component  Analysis  (ICA)  [1]  is  a  statistical  signal  processing 
technique  whose  main  applications  are  blind  source  separation,  blind  de- 
convolution,  and  feature  extraction.  In  the  simplest  form  of  ICA  [2],  one 
observes  m  scalar  random  variables  Xi^X2,^-,Xm  which  are  assumed  to  be 
linear  combinations  of  n  unknown  independent  components,  or  ICs,  denoted 
by  Si,S‘2,...,Sn-  These  ICs  Si  are  assumed  to  be  mutually  statistically  inde¬ 
pendent^  and  zero-mean.  Arranging  the  observed  variables  Xj  into  a  vector 
X  =  {xi,X2,.:,Xmy  and  the  IC  variables  s*  into  a  vector  s,  the  linear  rela¬ 
tionship  can  be  expressed  as 

X  =  As  (1) 

Here,  A  is  an  unknown  m  x  n  matrix  of  full  column  rank,  called  the  mixing 
matrix.  The  basic  problem  of  ICA  is  then  to  estimate  both  the  mixing  matrix 
A  and  the  realizations  of  the  ICs  si  using  only  observations  of  the  mixtures 
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Estimation  of  ICA  requires  the  use  of  higher-order  information,  i.e.,  other 
information  than  that  contained  in  the  covariance  matrix  of  x.  This  higher- 
order  information  is  usually  incorporated  in  the  estimation  procedures  by 
means  of  ’contrast’  functions  based  on  higher-order  cumulants  [2,  3].  How¬ 
ever,  little  justification  has  been  provided  in  the  literature  for  the  choice  of 
using  higher-order  cumulants  for  the  construction  of  the  contrast  functions. 
The  main  reason  for  their  popularity  seems  to  be  that  they  are  easy  to  analyze 
mathematically.  No  statistical  or  practical  arguments  in  favor  of  cumulants 
have  been  put  forth,  except  for  the  fact  that  they  may  be  more  resistant  to 
Gaussian  noise,  because  the  higher-order  cumulants  of  Gaussian  noise  vanish. 

In  this  paper,  we  analyze  mathematically  a  large  family  of  one-unit  con¬ 
trast  functions  introduced  in  [4].  The  asymptotic  variance  of  the  obtained  es¬ 
timators  is  evaluated,  and  it  is  shown  that  for  super-Gaussian  ICs,  the  asymp¬ 
totic  variance  is  minimized  for  contrast  functions  that  grow  much  slower  than 
the  4-th  power  inherent  in  the  fourth-order  cumulant  or  kurtosis.  Further¬ 
more,  robustness  against  outliers  also  requires  slowly  growing  contrast  func¬ 
tions.  As  most  ICs  encountered  in  practice  seem  to  be  super-Gaussian,  this 
means  that  kurtosis  may  be  a  rather  inadequate  contrast  function  in  most 
cases.  For  neural  learning  rules,  the  results  imply  that  better  estimates  are 
usually  obtained  using  (anti-)Hebbian  learning  functions  that  are  sigmoidal, 
or  even  go  to  zero  at  infinity.  Simulations  back  up  our  theoretical  arguments. 

2.  GENERAL  ONE-UNIT  CONTRAST  FUNCTIONS 

Consider  a  linear  combination  of  the  observed  mixtures  Xj,  say  w'^x,  where 
the  (weight)  vector  w  is  constrained  so  that  E{(w^x)^}  =  1.  Many  ICA 
algorithms  are  based  on  finding  the  extrema  of  the  square  of  the  kurtosis 
kurt^(w^x)  =  (E{(w^x)'^}-3)^  of  such  a  linear  combination  [2,  3].  This  can 
be  motivated  by  information-theoretic  arguments:  the  square  of  the  kurtosis 
can  be  shown  to  approximate  the  negentropy  of  w^x  [2].  Moreover,  it  can 
be  proven  that  the  square  of  the  kurtosis  of  w^x  is  maximized  exactly  in  the 
points  where  the  linear  combination  equals,  up  to  the  sign,  one  of  the  ICs, 
i.e.,  w^x  =  ±Si  for  some  i  [3,  5]. 

This  approach  was  generalized  in  [4,  6,  7],  where  it  was  shown  that  instead 
of  kurtosis,  practically  any  non-quadratic,  well-behaving  even  function,  say 
G,  can  be  used  to  construct  a  contrast  function  for  ICA.  Such  a  general 
contrast  function  can  be  defined  as 

Jg(w)  =  [£x{G(w^x)}  -  E,{G{i^)}f  (2) 

where  i/  is  a  standardized  Gaussian  variable.  The  second  term  in  brackets  is 
a  normalization  constant  that  makes  Jo  equal  to  zero  if  w^x  has  a  Gaussian 
distribution.  Clearly,  Jo  can  be  considered  a  generalization  of  the  square  of 
kurtosis,  as  for  G(u)  Jo  becomes  simply  the  square  of  kurtosis  of  w^x. 
It  was  shown  in  [6],  using  a  generalization  of  the  Gram-Charlier  expansion, 
how  Jo  approximates  the  negentropy  of  w^x  in  the  same  way  as  the  square  of 
the  kurtosis.  Furthermore,  it  was  shown  in  [7]  that  under  weak  assumptions. 
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Jg  is  locally  maximized  when  w^x  ~  ±Si,  i.e.  when  the  linear  combination 
equals  one  of  the  ICs.  Therefore,  Jq  can  be  used  as  a  contrast  function  for 
ICA  in  the  same  way  as  the  square  of  the  kurtosis.  Note  that  for  simplicity, 
we  shall  also  refer  to  G  as  a  contrast  function. 

Thus,  we  estimate  one  IC  by  solving  the  following  optimization  problem: 

w  =  arg  max  Jg(w)  (3) 

£^{(w^x)2}=l 

where  in  practice  the  expectations  are  replaced  by  sample  averages.  Note  that 
this  boils  down  to  maximizing  or  minimizing  £^{G(w^x)},  where  the  type 
of  extrema  searched  for  depends  on  the  sign  of  j5x{G(w^x)}  — 

To  estimate  all  the  ICs,  one  needs  only  to  find  all  the  local  solutions  of  this 
optimization  problem.  We  shall  not  consider  here  in  detail  how  to  solve  this 
optimization.  Two  simple  methods  are  possible.  First,  one  can  use  a  simple 
gradient  descent/ascent  with  a  decreasing  learning  rate,  as  is  considered  in 
more  detail  in  [7].  In  that  case  it  may  be  useful  to  first  whiten  (or  sphere) 
the  data,  which  simplifies  the  constraint  to  ||w||  =  1.  A  second  possibility 
is  the  fixed-point  algorithm  introduced  for  kurtosis  in  [8]  and  generalized  for 
any  G  in  [4].  However,  the  statistical  properties  of  the  estimator  defined  in 
(3)  do  not  depend  on  the  method  of  optimization. 

In  the  following,  we  shall  analyze  two  fundamental  statistical  properties  of 
w,  which  are  asymptotic  variance  and  robustness.  Though  in  principle  almost 
any  non-quadratic  even  function  G  can  be  used,  in  practice  the  performance 
of  different  contrast  functions  may  be  very  different  due  to  limited  sample 
sizes  and  deviations  from  the  model  (1).  Therefore,  some  analysis  is  needed 
to  provide  guidelines  on  how  to  choose  the  function  G  to  obtain  a  statistically 
adequate  estimator. 


3.  ASYMPTOTIC  VARIANCE 


In  practice,  one  usually  has  only  a  finite  sample  of  N  observations  of  the 
vector  X.  Therefore,  the  expectations  in  the  definition  of  Jq  are  in  fact 
replaced  by  sample  averages.  This  results  in  certain  errors  in  the  estimator 
w,  and  it  is  desired  to  make  these  errors  as  small  as  possible.  A  classical 
measure  of  this  error  is  asymptotic  (co)variance,  which  means  the  limit  of 
the  covariance  matrix  of  wy/N  as  N  ->  oo.  This  gives  an  approximation  of 
the  mean-square  error  of  w.  Cofnparison  of,  say,  the  traces  of  the  asymptotic 
variances  of  two  estimators  enables  direct  comparison  of  the  accuracy  of  two 
estimators.  One  can  solve  analytically  for  the  asymptotic  variance  of  w, 
obtaining  the  following  theorem: 

Theorem  1  The  trace  of  the  asymptotic  variance  ofw  as  defined  in  (3)  for 
the  estimation  of  the  independent  component  Si,  equals 


{E{sw{si)-g'{si)}y 


(4) 
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where  g  is  the  derivative  of  G,  and  C{A)  is  a  constant  that  depends  only  on 

A. 

Proof;  Making  the  change  of  variable  z  =  A^w,  the  equation  defining  the  optimal 
solutions  z  becomes 

=  A  ^  stsj z  (5) 

t  t 

where  t  =  1, T  is  the  sample  index,  T  is  the  sample  size,  and  A  is  a  Lagrangian 
multiplier..  Without  loss  of  generality,  let  us  assume  that  z  is  near  the  ideal  solution 
z  =  (1,0,0,...)-  Note  that  due  to  the  constraint  J5{(w^x)^}  =  |lz||^  =  1,  the 
variance  of  the  first  component  of  z,  denoted  by  5i,  is  of  a  smaller  order  than 
the  variance  of  the  vector  of  other  components,  denoted  by  z-i.  Excluding  the 
first  component  in  (5),  and  making  the  first-order  approximation  p(z^s)  =  ^(si)  + 
5'(si)zTiS_i,  where  also  s_i  denotes  s  without  its  first  component,  one  obtains 
after  some  simple  manipulations 

^^s-i[5(si)  -  Asi]  =  ^s_i[-sTi5  (si)  +  As!^i]z-iVT  (6) 

t  t 

where  the  sample  index  t  has  been  dropped  for  simplicity.  Making  the  first-order 
approximation  A  =  E{sip(si)},  one  can  write  (6)  in  the  form  u  ^  vz.-iVT  where 
V  converges  to  the  identity  matrix  multiplied  by  E{sig{si)}  —  ^{^'(si)},  and  u 
converges  to  a  variable  that  has  a  normal  distribution  of  zero  mean  whose  covariance 
matrix  equals  the  identity  matrix  multiplied  by  -  (£'{sig(si)})^.  This 

implies  the  theorem,  since  z-i  =  Bw,  where  B  is  the  inverse  of  A^  without  its 
first  row. 

Thus  the  comparison  of  the  asymptotic  variances  of  two  estimators  of 
the  form  in  (3),  but  for  two  different  contrast  functions  G,  boils  down  to  a 
comparison  of  the  Vg’s.  In  particular,  one  can  use  variational  calculus  to  find 
a  G  that  minimises  Vg-  Thus  one  obtains  the  following  theorem: 

Theorem  2  The  trace  of  the  asymptotic  variance  of  w  is  minimized  when 
G  is  of  the  form 

Gopt  (w)  =  Cl  log  f(u)  +  C2U^  +  C3  (7) 

where  f  is  the  density  function  of  si,  and  Ci,C2,C3  are  arbitrary  constants. 

For  simplicity,  one  can  choose  Gopt{u)  =  ^ogf{u).  Thus  one  sees  that  the 
optimal  contrast  function  is  the  same  as  the  one  obtained  for  several  units  by 
the  maximum  likelihood  approach  [9],  or  the  infomax  approach  [10].  Almost 
identical  results  have  also  been  obtained  in  [11]  for  another  multi-unit  algo¬ 
rithm.  Our  results  treat,  however,  the  one-unit  case  instead  of  the  multi-unit 
case,  and  are  thus  applicable  to  estimation  of  a  subset  of  the  ICs,  and  to 
blind  deconvolution  [7]. 


4.  ROBUSTNESS 

Another  very  desirable  property  of  an  estimator  is  robustness  against  outliers 
[12].  This  means  that  single,  highly  erroneous  observations  do  not  have  much 
influence  on  the  estimator. 
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In  this  paper,  we  shall  treat  the  question:  How  does  the  robustness  of  the 
estimator  w  depend  on  the  choice  of  the  function  G1  Note  that  the  robustness 
of  w  depends  also  on  the  method  of  estimation  used  in  constraining  the 
variance  of  w^x  to  equal  unity  in  (3).  This  is  a  problem  independent  of  the 
choice  of  G.  In  the  following,  we  assume  that  this  constraint  is  implemented 
in  a  robust  way.  In  particular,  we  assume  that  the  data  is  sphered  (whitened) 
in  a  robust  manner,  in  which  case  the  constraint  reduces  to  ||w||  =  1,  Several 
robust  estimators  of  the  variance  of  w^x  or  of  the  covariance  matrix  of  x  are 
presented  in  the  literature;  see  [12]. 

The  robustness  of  the  estimator  w  in  (3)  can  be  analyzed  using  the  theory 
of  M-estimators  [12].  Without  going  into  technical  details,  the  definition  of 
an  M-estimator  can  be  formulated  as  follows:  an  estimator  is  called  an  M- 
estimator  if  it  is  defined  as  the  solution  9  for  9  of 


E{^P(yi,9)}  =  {)  (8) 

where  x  is  a  random  vector  and  if)  is  some  function  defining  the  estimator. 
The  estimator  w  in  (3)  is  an  M-estimator.  To  see  this,  define  9  =  (w,A), 
where  A  is  the  Lagrangian  multiplier  associated  with  the  constraint.  Using 
the  Kuhn-Tucker  conditions,  the  estimator  w  can  then  be  formulated  as  the 
solution  of  equation  (8)  where  -0  =  is  defined  as  follows  (for  sphered  data): 


^j(x,^) 


xp(w^x)  +  cAw  \ 
||w||2-l  ) 


(9) 


where  c  =  (T^x{G(w^x)}  -  Ei,{G{u)])~'^  is  an  irrelevant  constant. 

The  analysis  of  robustness  of  an  M-estimator  is  based  on  the  concept  of 
an  infuence  function,  IF(x,^).  Intuitively  speaking,  the  influence  function 
measures  the  influence  of  single  observations  on  the  estimator.  It  would  be 
desirable  to  have  an  influence  function  that  is  bounded  as  a  function  of  x,  as 
this  implies  that  even  the  influence  of  a  far-away  outlier  is  ’bounded’,  and  can¬ 
not  change  the  estimate  too  much.  This  requirement  leads  to  one  definition 
of  robustness,  which  is  called  B-robustness.  An  estimator  is  called  B-robust, 
if  its  influence  function  is  bounded  as  a  function  of  x,  i.e.,  sup^  ||IF(x,  ^)||  is 
finite  for  every  9.  Even  if  the  influence  function  is  not  bounded,  it  should 
grow  as  slowly  as  possible  when  ||x||  grows,  to  reduce  the  distorting  effect  of 
outliers. 

It  can  be  shown  [12]  that  the  influence  function  of  an  M-estimator  equals 


IF(x,^)  =BV^(x,^)  (10) 

where  B  is  an  irrelevant  invertible  matrix  that  does  not  depend  on  x.  On 
the  other  hand,  using  our  definition  of  '0j,  and  denoting  by  7  =  w^x/||x|| 
the  cosine  of  the  angle  between  x  and  w  ,  one  obtains  easily 

||v>(x,  (w,  A))||^  =  Cl  ^h^(w'^x)  +  C2h(w‘^x)  +  C3  (11) 
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where  Ci,C2,C3  are  constants  that  do  not  depend  on  x,  and  h{u)  =  ug{u). 
Thus  on  sees  that  the  robustness  of  w  essentially  depends  on  the  behavior 
of  the  function  h{u).  The  slower  h{u)  grows,  the  more  robust  the  estimator. 
However,  the  estimator  cannot  be  really  B-robust,  because  the  7  in  the  de¬ 
nominator  prevents  the  influence  function  from  being  bounded  for  all  x.  In 
particular,  outliers  that  are  almost  orthogonal  to  w,  and  have  large  norms, 
may  still  have  a  large  influence  on  the  estimator.  These  results  are  stated  in 
the  following  theorem: 

Theorem  3  Assume  that  the  data  x  is  whitened  (sphered)  in  a  robust  man¬ 
ner.  Then  the  influence  function  of  the  estimator  w  is  never  hounded  for  all 
X.  However,  ifh{u)  =  ug{u)  is  bounded,  the  influence  function  is  bounded  in 
sets  of  the  form  {x  |  w^x/||x||  >  e}  for  every  e>0,  where  g  is  the  derivative 
ofG. 

In  particular,  if  one  chooses  a  contrast  function  G{u)  that  is  bounded,  h  is 
also  bounded,  and  w  is  quite  robust  against  outliers.  If  this  is  not  possible, 
one  should  at  least  choose  a  contrast  function  G{u)  that  does  not  grow  very 
fast  when  |u|  grows.  If,  in  contrast,  G{u)  grows  very  fast  when  |u|  grows, 
the  estimates  depend  mostly  on  a  few  observations  far  from  the  origin.  This 
leads  to  highly  non-robust  estimators,  which  can  be  completely  ruined  by 
just  a  couple  of  bad  outliers.  This  is  the  case,  for  example,  when  kurtosis  is 
used  as  a  contrast  function,  which  is  equivalent  to  using  w  with  G{u)  = 

5.  CHOOSING  THE  CONTRAST  FUNCTION  IN  PRACTICE 

It  is  useful  to  analyze  the  implications  of  the  theoretical  results  of  the  pre¬ 
ceding  sections  by  considering  the  following  family  of  density  functions: 

fa{x)  =  Cl  exp(C2|x|")  (12) 

where  a  is  a  positive  constant,  and  Ci ,  C2  are  normalization  constants  that 
ensure  that  fa  is  a  probability  density  of  unit  variance.  For  different  values  of 
alpha,  the  densities  in  this  family  exhibit  different  shapes.  For  .5  <  o:  <  2,  one 
obtains  a  sparse,  super-Gaussian  density  (i.e.  a  density  of  positive  kurtosis). 
For  a  =  2,  one  obtains  the  Gaussian  distribution,  and  for  a  >  2,  a  sub- 
Gaussian  density  (i.e.  a  density  of  negative  kurtosis).  Thus  the  densities  in 
this  family  can  be  used  as  examples  of  different  non-Gaussian  densities. 

Using  Theorem  2,  one  sees  that  in  terms  of  asymptotic  variance,  an  opti¬ 
mal  contrast  function  for  estimating  an  IC  whose  density  function  equals  fa, 
is  of  the  form: 

Goptiu)  =  iur  (13) 

where  the  arbitrary  constants  have  been  dropped  for  simplicity.  This  im¬ 
plies  roughly  that  for  super-Gaussian  (resp.  sub-Gaussian)  densities,  the 
optimal  contrast  function  is  a  function  that  grows  slower  than  quadratically 
(resp.  faster  than  quadratically).  Next,  recall  from  Section  4  that  if  G{u) 
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grows  fast  with  |i2|,  the  estimator  becomes  highly  non-robust  against  out¬ 
liers.  Taking  also  into  account  the  fact  that  most  ICs  encountered  in  practice 
are  super-Gaussian,  one  reaches  the  conclusion  that  as  a  general-purpose 
contrast  function,  one  should  choose  a  function  G  that  resembles  rather 

—  l^r,  where  a  <  2.  (14) 

The  problem  with  such  contrast  functions  is,  however,  that  they  are  not 
differentiable  at  0  for  a  <  1.  Thus  it  is  better  to  use  approximating  differen¬ 
tiable  functions  that  have  the  same  kind  of  qualitative  behavior.  Considering 
a  =  1,  in  which  case  one  has  a  double  exponential  density,  one  could  use 
instead  the  function  Gi{u)  =  log  cosh  oiw  where  ai  >  1  is  a  moderately  large 
constant.  Note  that  the  derivative  of  Gi  is  then  the  familiar  tanh  function 
(for  ai  =  1).  In  the  case  of  a  <  1,  i.e.  highly  super-Gaussian  ICs,  one  could 
approximate  the  behavior  of  Gopt  for  large  u  using  a  Gaussian  function  (with 
a  minus  sign):  G2{u)  =  —  exp(— 02U^/2)  where  a2  is  a  constant.  The  deriva¬ 
tive  of  this  function  is  like  a  sigmoid  for  small  values,  but  goes  to  0  for  larger 
values.  Note  that  this  function  also  fulfills  the  condition  in  Theorem  3,  thus 
providing  an  estimator  that  is  as  robust  as  possible  in  this  framework.  We 
have  found  ai  =  2  and  02  =  1  to  provide  ’good’  approximations  of  G\  and 
G2 .  Note  that  there  is  a  trade-off  between  the  precision  of  the  approximation 
and  the  smoothness  of  the  resulting  objective  function. 

Thus,  we  reach  the  following  general  conclusion: 

•  a  good  general-purpose  contrast  function  is  G{u)  ~  log  cosh  oiw,  where 
fli  >  1  is  a  constant. 

•  when  the  ICs  are  highly  super-Gaussian,  or  when  robustness  is  very 
important,  G{u)  =  —  exp(— a2W^/2)  with  02  ^  1  may  be  better. 

•  using  kurtosis  is  justified  only  if  the  ICs  are  sub-Gaussian  and  there  are 
no  outliers. 

In  this  paper,  we  have  used  purely  statistical  criteria  for  choosing  the 
contrast  function.  One  important  criterion  that  is  completely  independent  of 
statistical  considerations  is  computational  simplicity.  For  example,  the  calcu¬ 
lation  of  the  tanh  function  is  rather  slow  in  many  environments.  The  conver¬ 
gence  may  be  speeded  up  if  one  uses  instead  piecewise  linear  approximations 
of  the  derivatives  of  the  contrast  functions.  In  the  case  of  g{u)  ~  tanh(o2w), 
one  may  define  g  so  that  g{u)  =  03^  for  |u|  <  I/gs  and  g{u)  =  sign(u)  other¬ 
wise,  where  03  >  1  is  a  constant.  This  amounts  to  using  the  so-called  Huber 
function  [12]  as  G. 


6.  SIMULATIONS 

We  performed  simulations  in  which  3  different  contrast  functions  were  used 
to  estimate  one  IC  from  a  mixture  of  4  i.i.d.  ICs.  The  contrast  functions 
used  were  kurtosis,  and  the  two  functions  proposed  in  the  preceding  section: 
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log  cosh  (or  Gi)  and  the  Gaussian  function  (or  G2).  The  constants  were  set  as 
suggested  in  the  preceding  section.  We  also  used  three  different  distributions 
of  the  ICs:  uniform,  double  exponential  (or  Laplace),  and  the  distribution  of 
the  third  power  of  a  Gaussian  variable.  The  sample  size  was  fixed  at  1000  and 
the  fixed-point  algorithm  in  [4]  was  used  to  maximize  the  contrast  function. 
The  asymptotic  mean  absolute  deviations  (MAD)  between  the  components  of 
the  obtained  vectors  and  the  correct  solutions  were  estimated  and  averaged 
over  1000  runs  for  each  combination  of  non-linearity  and  distribution  of  IC. 
MAD  was  used  instead  af  variance  because  it  is  a  more  robust  measure  of 
deviation. 

The  results  in  the  basic,  noiseless  case  are  depicted  in  Fig.  1.  As  one  can 
see,  the  estimates  using  kurtosis  were  essentially  worse  for  super-Gaussian 
ICs.  Especially  the  strongly  super-Gaussian  IC  (cube  of  Gaussian)  was  es¬ 
timated  considerably  worse  using  kurtosis.  Only  for  the  sub-Gaussian  IC, 
kurtosis  was  better  than  the  other  contrast  functions.  There  was  no  clear 
difference  between  the  performances  of  the  contrast  functions  Gi  and  G2. 

Next,  the  experiments  were  repeated  with  added  Gaussian  noise  whose 
energy  was  10%  of  the  energy  of  the  ICs.  The  results  are  shown  in  Fig.  2.  This 
time,  kurtosis  did  not  perform  better  even  in  the  case  of  the  sub-Gaussian 
density.  This  result  goes  against  the  view  that  kurtosis  would  tolerate  Gaus¬ 
sian  noise  well.  Indeed,  the  theoretical  arguments  supporting  that  view  ne¬ 
glect  any  finite-sample  effects,  and  may  thus  have  rather  limited  validity. 

No  outliers  were  added  in  these  experiments.  Experiments  confirming  the 
robustness  of  the  non-linearities  proposed  in  section  5  can  be  found  in  [4], 

7.  CONCLUSION 

The  problem  of  choosing  the  contrast  function  for  ICA  was  treated.  The 
behavior  of  a  large  family  of  contrast  functions,  which  includes  kurtosis  as 
a  special  case,  was  analyzed.  Combining  the  results  on  asymptotic  variance 
and  robustness  against  outliers,  it  was  shown  that  the  use  of  kurtosis  is  not 
justified  on  statistical  grounds,  except  perhaps  for  sub-Gaussian  independent 
components.  Instead,  contrast  functions  that  grow  slower  than  quadratically 
were  found  to  be  better  approximations  of  the  optimal  ones  in  most  cases. 
In  neural  learning  rules,  this  leads,  e.g.,  to  the  use  of  tanh-like  sigmoids,  or 
functions  resembling  the  derivative  of  a  Gaussian  function. 
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Figure  1:  Estimation  errors  plotted  for  different  contrast  functions  and  dis¬ 
tributions  of  the  ICs,  in  the  noiseless  case.  Asterisk:  uniform  distribution. 
Plus  sign:  Double  exponential.  Circle:  cube  of  Gaussian. 
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Figure  2:  The  noisy  case.  Estimation  errors  plotted  for  different  contrast 
functions  and  distributions  of  the  ICs.  Asterisk:  uniform  distribution.  Plus 
sign:  Double  exponential.  Circle:  cube  of  Gaussian. 
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Abstract 

Local  independent  component  analysis  is  formulated  as  a 
task  involving  the  extraction  of  local  geometric  structure  in 
the  joint  distribution.  Because  the  geometrical  structure  of 
statistical  independence  is  not  well  captured  by  statistical 
descriptions  such  as  moments  and  cumulants,  we  use  feature 
detection  tools  from  image  analysis  to  locate  the  local  in¬ 
dependent  component  coordinate  system.  The  resulting  ap¬ 
proach  to  source  separation  can  be  implemented  in  real  time 
using  conventional  image  analysis  hardware.  The  generality 
of  this  approach  is  demonstrated  by  blind  source  separation 
of  multi-modal  sources,  and  the  pseudo-separation  of  three 
sources  from  two  mixtures. 


1  INTRODUCTION 

The  blind  source  separation  or  independent  component  analysis  (ICA)  algo¬ 
rithms  of  Bell  and  Sejnowski  [2],  Pearlmutter  and  Parra  [8],  Amari,  Cichocki, 
Yang  [1]  and  Cardoso  and  Laheld  [5]  all  attempt  to  find  a  global  coordinate 
system  where  the  joint  distribution  takes  on  a  product  form.  These  linear  ICA 
algorithms  are  all  non-local  and  linear  in  the  sense  that  non-local  informa¬ 
tion  is  used  and  hence  only  linear  mixtures  can  be  separated.  They  involve 
stochastic  gradient  descent  of  the  density  estimation  parameters  on  a  cost 
function  such  as  the  Kullback-Leibler  divergence,  and  only  work  when  given 
adequate  priors  on  the  joint  distribution’s  parametric  form.  Recently,  the 
authors  directed  attention  to  the  intrinsic  local  structure  of  source  mixtures, 
and  introduced  a  local  aligned  equipartition  approach  [6,  7].  The  resulting 
algorithm  performs  non-parametric  density  estimation  and  blind  source  sep¬ 
aration  of  non-linear  mixtures.  In  this  contribution,  we  further  pursue  the 
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local  geometric  feature  approach,  but  instead  use  well  studied  tools  from  the 
field  of  image  analysis  for  local  feature  extraction.  In  contrast  to  density  es¬ 
timation  source  separation  algorithms,  we  use  “edge  features”  in  the  density 
distribution  to  locate  the  independent  component  coordinate  system.  We 
demonstrate  two  dimensional  source  separation  for  visual  and  computation 
reasons,  though  the  approach  is  clearly  not  limited  in  dimensionality. 


2  LOCAL  GEOMETRICAL  STRUCTURE 

In  the  blind  separation  problem,  we  are  given  input  consisting  of  a  non¬ 
singular  mixture  of  statistically  independent  sources,  {x}  =  {^5}  where  the 
joint  distribution  in  the  source  frame  factorizes:  Psi^  =  First  we 

rewrite  the  mixing  equation  in  a  more  suggestive  form  with  the  jth  column 
of  A  denoted  by  the  vector  \aj) 


Xi 

X2 


=  (  |ai> 


|a2) 


. . .  |aiv) 


(1) 


The  column  vectors  of  A  define  an  independent  component  coordinate  sys¬ 
tem.  The  independent  component  basis  vectors  are  in  general  not  normalized 
and  orthogonal.  Prom  the  mixing  equation,  it  can  be  seen  that  the  compo¬ 
nents  of  the  input  in  the  independent  component  coordinate  system  are  the 
independent  source  amplitudes.  As  an  example,  for  sharply  peaked  sources, 
when  only  one  source  Si  deviates  significantly  from  its  mean,  the  mixed  sig¬ 
nal  will  fall  predominantly  along  a  line  parallel  to  d^.  More  generally,  high 
density  regions  or  directions  are  mapped  to  other  high  density  regions  or  di¬ 
rections.  In  the  source  frame,  the  high/low  density  regions  are  determined  by 
the  location  of  the  extrema  of  the  individual  source  distributions:  p'i^{sk)  =  0. 
Consequently  the  extremal  density  directions  are  parallel  to  the  source  di¬ 
rections.  Locating  the  extremal  density  directions  in  the  mixture  frame  thus 
allows  for  source  extraction  from  the  mixture. 


3  SOURCE  SEPARATION  ALGORITHM 

The  local  feature  detection  source  separation  algorithm  operates  in  batch 
mode,  and  consists  of  the  following  steps: 

a.  Obtain  histogram  of  the  joint  distribution 

The  input  vector  space  is  partitioned  into  bins.  A  straightforward  counting 
of  the  input  data  points  in  each  bin  gives  us  the  histogram  of  the  distribution. 

b.  Determine  the  gradient  of  the  distribution 

We  convolve  the  binned  data  set  with  a  discrete  derivative  operator  of  sup¬ 
port  L  to  approximate  the  local  gradients  of  the  joint  probability  distribution. 
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In  practice,  we  have  found  that  L  =  7  for  our  numerics  is  a  good  compro¬ 
mise  between  accuracy  of  the  gradient  estimate  and  appropriate  sensitivity 
to  curvature  in  the  data  set. 


c.  Threshold  the  gradient 

We  first  normalize  to  the  maximum  amplitude  of  the  gradient.  Then  we 
consider  only  the  regions  where  the  magnitude  of  the  normalized  gradient  is 
larger  than  a  certain  threshold.  These  regions  are  what  we  term  “edges”  of 
the  distribution.  An  extrema  density  region  in  the  joint  distribution  will  be 
surrounded  by  two  regions  of  high  gradient  with  roughly  opposite  orienta¬ 
tions.  Thus  there  are  two  parallel  lines  surrounding  each  extremal  direction. 
Numerically,  we  found  that  the  directions  of  these  lines  could  be  determined 
more  accurately  by  considering  them  separately.  This  was  accomplished  by 
partitioning  the  large  gradient  regions  according  to  whether  the  gradient  ori¬ 
entation  points  in  the  upper  or  lower  half  plane. 


Figure  1:  Left:  input  distribution  in  the  mixture  coordinate  frame.  Right: 
density  histogram  mesh  plot. 


d.  Hough  transform  the  edge  image 

The  resulting  binary  images  of  the  edges  of  the  distribution  are  Hough  trans¬ 
formed  for  the  detection  of  high  gradient  lines  (see  e.g.  Jahne  [4]).  The  peaks 
in  the  Hough  transform,  which  correspond  to  the  “most  popular  lines”,  are 
located.  By  inserting  the  two  most  popular  line  orientations  into  an  unmix¬ 
ing  matrix,  an  unmixing  transformation  which  separates  the,  mixture  can  be 
obtained. 


4  SIMULATION  RESULTS 


4.1  Separation  of  unimodal  sources 

Two  audio  files  consisting  of  spoken  Japanese  phrases  were  linearly  mixed  by 
a  random  mixing  matrix.  The  input  distribution  consisting  of  34934  points 
was  histogrammed  into  80  x  80  bins,  as  shown  in  Figure  1.  The  plot  clearly 
shows  the  geometrical  structure  discussed  in  Section  1.  The  normalized  gra¬ 
dient  of  the  histogram  is  computed  and  thresholded  at  five  percent  of  the 
maximum  value,  giving  us  the  “edges”  of  the  joint  distribution.  Figure  2 
shows  the  “edge”  regions  corresponding  to  large  gradient  regions  pointing  in 
the  lower  half  plane.  The  Hough  transforms  of  the  binary  images  are  shown 
alongside.  From  the  two  most  popular  line  orientations.  An  unmixing  matrix 
was  constructed  from  the  two  most  popular  line  orientations.  Left  multiplying 
the  mixing  matrix  with  the  unmixing  matrix,  we  find  the  resulting  mixture 
to  signal  ratio  of  the  two  outputs  to  be  0.95  and  3.8  percent  respectively. 


Figure  2:  Left:  high  gradient  “edge”  regions  pointing  in  the  lower  half  plane. 
Right:  corresponding  Hough  transform.  As  labeled  here,  the  angle  6  cor¬ 
responds  to  the  lines’  actual  orientations  instead  of  the  orientation  of  the 
normal  to  the  lines.  The  grey-scale  bar  on  the  far  right  indicates  the  number 
of  votes  for  each  line  in  the  Hough  transform. 


4.2  Separation  of  bimodal  sources 

Artificial  bimodal  sources  were  constructed  from  the  same  audio  files  used 
in  the  previous  section.  The  mass  was  intentionally  not  evenly  distributed 
between  the  two  peaks  in  each  source.  The  mass  ratio  between  the  peaks 
was  5/3  in  one  source  and  5/2  in  the  other.  Since  gradients  are  used,  the 
amplitudes  of  the  peaks  are  not  as  essential  as  they  are  in  density  estimation 
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Figure  3:  Separation  of  bimodal  sources.  Left:  input  distribution.  Right: 
Hough  transform  with  four  peaks  corresponding  to  the  four  lines.  The  figure 
axes  labels  are  the  same  as  that  for  previous  figures. 


source  separation  approaches.  The  input  distribution,  large  gradient  regions, 
and  its  Hough  transform  are  shown  in  Figure  3.  Again  using  the  the  two  most 
popular  line  orientations  from  the  Hough  transform,  the  separation  achieved 
a  reduction  of  the  mixture  to  1  and  1.6  percent  of  the  sources  respectively  for 
the  two  outputs. 

4.3  Separation  of  sources  from  fewer  mixtures 

Not  only  does  this  approach  work  for  multi-modal  sources,  the  “non-square” 
blind  separation  problem  where  the  number  of  mixtures  is  less  than  the  num¬ 
ber  of  sources  can  also  be  tackled.  Extra  sources  will  just  contribute  extra 
density  extrema  directions.  The  audio  sources  were  pre-processed  to  make 
them  even  more  sharply  peaked  to  ease  the  feature  extraction  process.  Simu¬ 
lation  results  are  shown  in  Figure  4.  The  three  peaks  in  the  Hough  transform 
now  correspond  to  the  three  sources  in  the  mixture.  The  three  most  pop¬ 
ular  line  orientation  peaks  found  in  the  Hough  transform  deviate  by  only 
0.009,  0.01  and  0.002  radians  from  the  actual  orientations  as  determined  by 
the  mixing  matrix.  However,  this  information  is  not  sufficient  to  construct 
the  two-to-three  dimensional  unmixing  map.  The  problem  is  not  well  posed 
since  the  mixing  transformation  is  not  1-to-l.  Nevertheless,  a  simple  “one 
channel”  representation  of  the  three  sources  can  be  obtained  by  partitioning 
the  input  space  in  accordance  with  the  three  high  density  source  orientations 
[7].  A  given  input  x  is  attributed  to  the  source  with  the  closest  corresponding 
orientation,  with  the  value  of  the  output  taken  to  be  the  dot  product  of  the 
input  vector  x  with  a  vector  of  unit  length  along  the  source  orientation.  In 
this  unmixing  scheme,  only  one  output  is  non-zero  at  any  given  time,  hence 
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the  overlap  between  any  two  outputs  is  zero.  Despite  the  severity  of  this 
constraint,  the  fidelity  of  the  resulting  one-channel  source  approximations  is 
remarkably  good,  as  seen  in  Figure  5. 


Figure  4:  Pseudo-separation  of  three  sharply  peaked  sources.  Left:  input 
distribution.  Right:  Hough  transform  with  three  peaks  corresponding  to  the 
three  sources. 

5  CONCLUSION 

The  algorithm  presented  in  this  paper  applies  specifically  to  the  separation 
of  two  dimensional  data  sets.  However,  this  approach  is  clearly  more  general 
than  the  specific  separation  examples  presented.  The  algorithm  is  applica¬ 
ble  to  source  distributions  which  contain  large  derivatives,  such  as  uniform 
sources.  While  the  Hough  transform  used  is  optimized  for  detecting  lines  in 
two  dimensions,  other  versions  exist  which  are  sensitive  to  more  general  fea¬ 
tures  at  a  range  of  spatial  scales.  It  is  also  clear  that  the  approach  introduced 
here  is  not  limited  to  two  dimensional  or  linearly  mixed  signals. 

The  success  of  this  edge-detection  algorithm  suggests  many  possible  varia¬ 
tions.  For  example,  if  the  sources  are  known  to  be  unimodal  and  sharply 
peaked,  an  even  simpler  algorithm  can  be  used.  By  normalizing  all  the  input 
vectors  to  unit  length,  the  task  of  finding  the  N  independent  component  basis 
vectors  becomes  that  of  finding  N  clusters  on  the  A^-sphere.  The  algorithm 
presented  in  this  paper  works  for  a  larger  class  of  sources  because  it  extracts 
information  from  all  high  gradient  regions  in  the  joint  distribution.  Prom 
a  mathematical  standpoint,  we  are  essentially  looking  at  the  iso-probability 
lines  in  large  gradient  areas,  with  the  understanding  that  they  are  oriented 
preferentially  along  one  of  the  independent  component  basis  directions.  A 
generalization  of  the  algorithm  consists  of  cluster  analysis  of  the  local  gradi¬ 
ent  directions. 
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Figure  5:  The  three  sources  (si,  52, 53),  two  mixtures  {xl,x2)  and  the  three 
outputs  (ol,o2,o3)  obtained  by  partitioning  the  input  space  into  three  re¬ 
gions.  The  outputs  were  permuted  and  arbitrarily  multiplied  by  a  scaling 
factor  to  match  the  sources.  Even  though  only  one  output  deviates  from  zero 
at  any  given  time,  the  source  approximation  is  surprisingly  good.  This  is  seen 
visually  above,  and  can  be  verified  by  listening  to  the  outputs. 


The  simplicity  of  this  approach  should  lead  to  state-of-the-art  algorithms  in 
terms  of  both  speed  and  generality.  Not  only  is  the  local  feature  approach 
more  robust  to  noise  than  density  estimation  approaches,  the  performance 
degrades  gracefully  when  extra  sources  are  introduced.  And  finally,  from  a 
neural  modeling  perspective,  this  approach  complements  the  current  work  on 
visual  cortex  modeling.  Bell  and  Sejnowski  [3]  found  that  local  edge  filters  re¬ 
sulted  when  natural  images  where  fed  into  their  source  separation  algorithm. 
In  this  paper,  we  show  that  edge-detectors  can  perform  source  separation, 
and  hence  the  same  neural  architecture  that  codes  and  processes  images  can 
also  extract  independent  sources. 
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Abstract 

We  present  a  new  set  of  learning  rules  for  the  non¬ 
linear  blind  source  separation  problem  based  on  the  in¬ 
formation  maximization  criterion.  The  mixing  model  is 
divided  into  a  linear  mixing  part  and  a  nonlinear  trans¬ 
fer  channel.  The  proposed  model  focuses  on  a  paramet¬ 
ric  sigmoidal  nonlinearity  and  higher  order  polynomials. 

Our  simulation  results  verify  the  convergence  of  the  pro¬ 
posed  algorithms. 

1  INTRODUCTION 

In  blind  source  separation  or  independent  component  analysis  (ICA)  the 
problem  is  how  to  recover  independent  sources  given  the  sensor  outputs  in 
which  the  sources  have  been  mixed  in  an  unknown  channel.  The  problem 
has  become  increasingly  important  in  the  signal  processing  area  due  to  their 
prospective  application  in  speech  recognition,  telecommunications  and  med¬ 
ical  signal  processing.  The  linear  blind  source  separation  problem  has  been 
studied  by  researchers  in  the  field  of  neural  networks  [1,  2,  5,  9]  and  sta¬ 
tistical  signal  processing  [4,  6].  Potential  application  in  automatic  speech 
recognition  systems  has  been  considered  in  [10]  where  two  speech  signals 
recorded  in  a  real  environment  have  been  separated.  Furthermore,  Makeig 
et  al.  [12]  have  studied  independent  components  of  electroencephalographic 
(EEG)  data.  There  are  several  other  potential  applications  in  the  signal 
processing  area  which  may  benefit  from  ICA  as  a  preprocessing  analysis. 
Nevertheless,  the  linear  mixing  model  may  not  be  appropriate  for  some  real 
environment  experiments.  Therefore,  researchers  have  recently  started  ad¬ 
dressing  the  ICA  problem  to  nonlinear  mixing  models  [3,  8,  11,  13,  14,  15]. 
In  [8,  11,  13]  the  nonlinear  components  are  extracted  using  self-organizing- 
feature-maps  (SOFM).  However,  due  to  the  limited  number  of  neurons  that 
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map  the  underlying  distribution  the  derived  components  have  a  quantization 
error  that  increases  with  increasing  distance  to  the  neighboring  neurons. 

In  this  paper,  we  propose  a  set  of  algorithms  for  the  nonlinear  mixing 
problem  using  parametric  nonlinear  functions.  In  particular,  we  assume  that 
the  mixing  is  performed  in  two  stages:  a  linear  mixing  followed  by  a  nonlinear 
transfer  function.  We  focus  on  a  parametric  sigmoidal  nonlinearity  and  on 
higher  order  polynomials.  A  similar  approach  has  been  independently  studied 
by  Taleb  and  Jutten  [14].  They  approximate  the  inverse  transfer  function  by 
multilayer  perceptrons  (MLP)  that  are  trained  in  an  unsupervised  manner. 
This  kind  of  model  may  be  justified  for  several  biomedical  signal  analysis 
problems  such  as  brain  blood  moving  analysis  in  magnetic  resonance  imaging 
(MRI)  and  EEG  analysis.  It  may  also  be  used  to  account  for  microphone 
nonlinearities  in  speech  recording  experiments.  For  these  problems  this  model 
may  be  an  appropriate  representation  of  the  actual  physical  phenomenon. 

We  present  the  nonlinear  model  and  derive  a  set  of  learning  rules  based 
on  the  information  maximization  criterion  [2].  The  learning  rules  are  verified 
via  simulation  and  future  research  is  discussed  at  the  end. 

2  NONLINEAR  MIXING  AND  UNMIXING  MODEL 

Figure  1  shows  the  mixing  system  which  is  divided  into  a  linear  mixing  part 
and  a  nonlinear  transfer  part.  Each  channel  i  consists  of  an  invertible  nonlin¬ 
ear  transfer  function  The  unmixing  system  is  the  inverse  sequence  of 


linear  nonlinear  Inverse  linear 

mixing  transfer  function  transfer  function  unmbdng 


Figure  1:  Mixing  and  Unmixing  Model:  The  mixing  stage  consists  of  a  linear  mix¬ 
ing  matrix  A  and  a  nonlinear  transfer  function  f(t).  The  unmixing  stage  consists 
of  the  inverse  operation  -  the  equalization  of  the  nonlinear  transfer  function  g(x) 
and  the  unmixing  matrix  W 


the  mixing  system.  Figure  1  shows  that  we  first  invert  the  nonlinear  transfer 
function  in  each  channel  i  with  gi{xi)  and  than  unmix  the  linear  mixing  by 
applying  W  to  z.  The  sources  s  are  recovered  if  gi{xi)  and  W  are  the  inverse 
functions  for  fiiU)  and  A  respectively. 

In  our  model  we  use  the  following  signals:  s  =  [5i,52  •  •  • 

t  =  X  =  [xi,X2,...,Xn]'^,  Z  = 
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u  =  [wi,W2,...,«Nr,  f  =  [/i(*i),/2(i2),...,/iv(iiv)]^, 

g  =  [9i{xi)^g2(x2)^ . .  ^  ,gN{xN)Y '  Furthermore,  the  signals  are  related  by 
the  following  equations: 


t 

=  A  •  s 

(1) 

X 

=  f(t) 

(2) 

z 

=  g(x) 

(3) 

u 

=  W-z  =  W-g[f(A-s)] 

(4) 

3  THE  LEARNING  RULES 
3.1  Information  Maximization 

The  separation  of  independent  components  from  a  mixed  signal  observation 
X  can  be  described  by  a  general  measure  of  independence  between  the  pdf  of 
the  random  variable  Px(x)  and  the  pdf  of  its  components  The 

Kullback-Leibler  divergence  measures  the  degree  of  distance  defined  by: 

<5(Px(x),  JJpxi  (®i))  =  f  Px(x)  log 

and  vanishes  if  and  only  if  Px(x)  factories  which  leads  to 
<^(Px(x),nr=iPa:i(aJi))  =  0*  The  observation  is  Px(x)  =  ni=iPxi(xi)  and 
the  Kullback-Leibler  divergence  have  the  form  of  the  mutual  information  of 
X  and  this  can  be  rewritten  in  terms  of  entropies  as  follows: 

5(Px(x),Px(x))  =  H(p^(x))  -  fl^(px(x)  I  ftc(x))  (6) 

Bell  and  Sejnowski  [2]  have  proposed  an  information-theoretic  approach  where 
they  maximize  the  mutual  information  that  an  output  y  =  h(x)  of  a  neural 
processor  contains  about  its  input  x.  They  have  shown  that  for  invertible  and 
continuous  deterministic  mappings  h{x),  the  mutual  information  between  in¬ 
puts  and  outputs  can  be  maximized  by  maximizing  the  entropy  of  the  outputs 
alone  where  the  output  pdf  satisfies: 

with  J(x)  being  the  determinant  of  the  Jacobian  of  the  neural  transfer  func¬ 
tion  h(x).  The  Entropy  of  the  Signal  y  is  given  by 

Hiy)  =  -E[lnpy(y)]  =  E[\n  \J\]  -  E[lnpx(x)]  (8) 

where  E  denotes  the  expected  value.  The  maximization  is  done  by  maximiz¬ 
ing  the  first  term  with  respect  to  the  parameters  of  the  unmixing  functions. 
That  is,  we  have  to  learn  the  elements  of  the  linear  unmixing  matrix  W  and 
the  set  of  parameters  for  the  nonlinearities  gi{xi).  Using  a  gradient  ascent 
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algorithm  we  take  the  derivative  of  the  entropy  function  with  respect  to  the 
Wij  and  the  parameters  of  the  nonlinearity.  Therefore,  we  derive 


AWoc 


dH{y)  ^  d 
dW  dW 


ln|J| 


(9) 


The  second  term  in  equation  (8)  is  independent  of  all  model  parameters. 
Hence,  the  gradient  of  equation  (9)  is  as  follows: 


d 

dw 


ln|J|  =  ^ln|det(W)|+ 


d 

dw 


(10) 


Considering  the  set  of  parameters  W,  a  better  way  to  maximize  entropy  in 
the  feedforward  and  feedback  system  is  not  to  follow  the  entropy  gradient, 
as  in  [2],  but  to  follow  its  ‘natural’  gradient,  as  reported  by  Amari  et  al  [1]: 

AW  a  (11) 

aW 

This  is  an  optimal  rescaling  of  the  entropy  gradient.  It  simplifies  the  learning 
rule  and  speeds  convergence  considerably. 


3.2  Learning  Rules  for  Sigmoidal  Nonlinear  Mixing 

The  infomax  criterion  holds  for  our  model  since  independent  variables  can¬ 
not  become  dependent  by  passing  them  through  an  invertible  nonlinearity. 
Hence,  the  mutual  information  before  and  after  the  nonlinear  stage  is  not 
affected. 

For  the  derivation  of  the  learning  rule  for  the  Wij  we  do  not  need  to 
consider  the  last  term  of  equation  (10).  Therefore,  the  learning  rule  for  W 
is: 

AW  oc  (W^)-^  +  (1  -  2y)g^(x)  (12) 

considering  the  Amari  et  al.  extension  from  equation  (11)  it  follows: 

AW  oc  W  +  (1  -  2y)u^W  (13) 

Although  this  learning  rule  is  derived  for  super-Gaussian  sources  we  may 
extend  the  rule  to  the  separation  of  sub-Gaussian  sources  by  including  the 
kurtosis  into  the  second  term  which  makes  the  anti-Hebbian  rule  to  a  Hebbian 
learning  rule  for  sub-Gaussians.  Girolami  and  Fyfe  use  this  in  the  projection 
pursuit  network  [7].  In  order  to  derive  the  complete  set  of  learning  rules  we 
assume  that  the  nonlinear  mixing  is  accomplished  by  a  sigmoidal  transfer 
function. 

where  S  denotes  the  scaling  and  a  the  slope  of  the  transfer  function.  For  this 
case,  gi(xi)  provides  the  inverse  function  by 

9i(xi)  =  -2riaTCtaiLh{diXi)  (15) 
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3.3  Learning  Rules  for  Flexible  Nonlinearities 

A  weakness  of  the  sigmoidal  nonlinearity  is  that  the  learning  rules  can  be  suc¬ 
cessfully  applied  to  only  those  problems  which  fit  to  the  parametric  structure 
of  a  sigmoid.  However,  in  certain  situations  where  the  a  priori  knowledge 
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about  the  mixing  model  is  not  given  we  need  to  learn  a  more  flexible  nonlin¬ 
ear  transfer  function.  We  assume  that  a  nonlinearity  may  be  approximated 
by  polynomials  of  n-th  order.  This  nonlinear  stage  my  be  described  as: 

=  (23) 

k=l 

The  inverse  9j(xj)  of  the  function  in  equation  (23)  results  in  an  expression 
which  is  generally  not  easy  to  handle  with  respect  to  our  purposes.  We 
therefore  make  the  assumption  that  the  inverse  may  be  approximated  by 
P-th  order  polynomials.  Hence,  the  inverse  is: 


k-1 


k=l 


(24) 


In  the  same  manner,  we  can  perform  a  gradient  ascent  on  the  entropy  function 
to  learn  the  parameters  gjk’- 


^9jk  oc 


dH 

dgjk 


(25) 


Performing  this  operation  on  equation  (8),  the  learning  rule  for  finding  gjk 
is  the  sum  of  the  following  two  terms: 


d 

dgjk 


In 


and 


d 

^9jk 


In 


d 


=  ^(1  - 


m=l 


(26) 


4  SIMULATION  RESULTS 
4.1  Sigmoidal  Nonlinearities 

To  verify  the  validity  of  the  model  and  the  convergence  of  the  learning  rules, 
we  have  performed  sever2j  experiments  with  the  architecture  shown  in  Figure 
1.  Figure  2  shows  the  result  of  the  mixing  and  unmixing  system.  Two 
independent  white  noise  sources  with  super-Gaussian  distribution  have  been 
generated  artificially  and  are  shown  in  a  scatter  plot  in  Figure  2  (a).  The 
sources  have  been  first  linearly  mixed  (b)  and  then  nonlinearly  mixed  (c). 

■  ti{z) 

t2{z) 
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Xi  =  /i(0.5ii) 
X2  =  h{t2) 


(28) 


The  unmixing  results  in  Figure  2  (d)  and  (e)  when  the  nonlinearities  are 
initialized  identically  and  the  unmixing  matrix  W  is  chosen  randomly.  The 
algorithm  converges  after  presenting  500  samples  and  the  unmixed  signals 
are  shown  in  Figure  2(f).  The  SNR  for  the  observed  mixed  signals  xi,X2  are 
-Q.SdB  and  -Q.ldB  respectively.  For  the  unmixed  signals  ui,U2  the  SNR  is 
increased  to  S.9dB  and  S.OdB  respectively. 


5 
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Figure  2:  Mixing  and  Unmixing  Simulation,  (a)  independent  sources  (b)  linear 
mixed  sources  (c)  nonlinear  mixing  (d)  initially  unmixed  nonlinearity  (e)  initial 
separated  signals  u  (f)  final  separated  signals  u 


4.2  Flexible  Nonlinearities 

As  in  chapter  4.1,  we  performed  several  experiments  with  the  architecture 
shown  in  Figure  1  to  verify  the  learning  rules  for  flexible  nonlinearities.  For 
the  linear  and  nonlinear  mixing  stage  we  use  the  same  mixing  matrix  and 
nonlinearities  as  in  section  4.1.  The  independent  sources  are  white  noise  sig¬ 
nals  with  sub-Gaussian  distribution.  A  scatter  plot  of  the  sources  is  depicted 
in  Figure  3  (a).  The  unmixing  W  and  the  coefficients  gjk  of  the  nonlinearity 
forming  polynomials  are  chosen  randomly  with  Q  =  P.  Figure  3  (f)  shows 
the  results  after  presenting  1000  samples.  We  observe  that  the  order  of  the 
inverse  nonlinearity  ^(x)  has  to  be  higher  than  the  order  of  the  nonlinearity 
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/(t).  The  stability  of  the  polynomial  nonlinearity  is  highly  dependent  on 
the  initial  value  of  the  coefficients  and  may  be  chosen  to  approximate  an 
invertible  nonlinearity. 


initial  U  final  U 

Figure  3:  Mixing  and  Unmixing  Simulation  Using  Flexible  Nonlinearities,  (a)  inde¬ 
pendent  sources  (b)  linear  mixed  sources  (c)  nonlinear  mixing  (d)  initially  unmixed 
nonlinearity  (e)  initial  separated  signals  u  (f)  final  separated  signals  u 

In  figure  4  a)  and  b)  we  show  the  time  course  signal  of  a  sinusoid  and  a 
white  noise  signal  with  super-Gaussian  distribution.  The  signals  have  been 
mixed  linearly  and  transformed  by  a  nonlinear  transfer  function  f(t)  where 
f{t)  is  an  invertible  5th-order  polynomial  function.  The  inverse  is  approx¬ 
imated  by  a  8th-order  polynomial  function  g(x).  The  time  course  of  the 
recovered  signals  are  shown  in  figure  4  e)  and  f). 

5  CONCLUSIONS  AND  FUTURE  RESEARCH 

We  have  derived  a  set  of  learning  rules  for  the  nonlinear  blind  source  separa¬ 
tion  problem  based  on  the  information  maximization  criterion.  The  mixing 
model  is  divided  into  a  linear  mixing  part  and  a  nonlinear  transfer  channel. 
The  proposed  algorithms  are  focused  on  a  parametric  sigmoidal  nonlinearity 
and  higher  order  polynomials.  Simulation  results  have  been  performed  to 
verify  the  learning  rules. 

We  plan  to  apply  the  algorithms  to  biomedical  data  such  as  MRI.  To  this 
end,  we  need  to  investigate  further  the  stability  and  convergence  criterion  of 
the  proposed  algorithms.  In  addition,  the  model  can  be  extended  to  exhibit 
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Figure  4:  Mixing  and  Unmixing  Simulation  Using  Flexible  Nonlinearities. 


a  nonlinear  cross  channel  mixing.  Then,  instead  of  N  nonlinearities  for  N 
channels  we  have  to  find  nonlinearities.  Cross  channel  nonlinearities  have 
been  considered  by  Yang  et  al.  in  [15]  and  Burel  in  [3].  In  their  approach,  the 
nonlinear  observation  g(x)  has  been  linearly  mixed  by  a  second  mixing  matrix 
W2.  The  learning  rules  can  be  derived  to  find  Wi,  W2  and  the  nonlinearity 
g(x).  In  contrast  to  their  approach,  subject  of  our  future  interest  is  to  find 
nonlinear  cross-channels  which  can  be  parameterized  independently  from  the 
channel  transfer  functions. 
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Abstract 

This  paper  provides  a  detailed  and  rigorous  analysis  of  the  two 
commonly  used  methods  for  blind  source  separation;  Linear 
Independent  Component  Analysis  (ICA)  and  Information 
Maximization  (InfoMax).  The  paper  shows  analytically  that  ICA 
based  on  the  Kullback-Leibler  information  as  a  mutual  information 
measure  and  InfoMax  lead  to  the  same  solution  if  the 
parameterization  of  the  output  nonlinear  functions  in  the  latter 
method  is  sufficiently  rich.  Furthermore,  this  work  discusses  the 
alternative  redundancy  measures  not  based  on  the  Kullback-Leibler 
information  distance  and  Nonlinear  ICA.  The  practical  issues  of 
applying  ICA  and  InfoMax  are  also  discussed. 


1.  INTRODUCTION 

The  pioneer  work  of  Zipf  [1]  and  tihe  ideas  of  Attneave  [2]  about  information 
processing  in  visual  perception  have  led  to  Ihe  idea  that  nervous  system  and  brain 
may  be  regulated  by  an  economy  principle.  In  the  neural  network  society  these 
ideas  were  introduced  by  the  important  paper  of  Barlow  [3].  In  this  work  the  author 
presented  the  connectionist  model  of  unsupervised  learning  under  the  perspective 
of  redundancy  reduction.  The  minimum  entropy  coding  method  was  introduced  for 
the  generation  of  factorial  codes  [4].  Atick  and  Redlich  [5]  demonstrated  that 
statistically  salient  input  features  can  be  optimally  extracted  from  a  noisy  input  by 
maximizing  mutual  information.  Simultaneously,  Atick  and  Redlich  [6]  and 
specially  the  works  of  Redlich  ([7],  [8])  concentrate  on  the  original  idea  of  feature 
extraction  by  redundancy  reduction.  Several  neural  network  learning  algorithms 
for  PC  A  are  presented,  among  others,  in  [9]  and  [10]. 

The  problem  of  Linear  Independent  Component  Analysis  as  linear  feature 
extraction,  i.e.  blind  source  separation  was  introduced  by  Comon  [11]  and  further 
extended  in  linear  and  defined  in  nonlinear  case  by  the  works  of  the  authors  ([12]- 
[19]).  In  parallel,  Bell  and  Sejnowski  [20]  have  demonstrated  that  their  InfoMax 
method  can  also  achieve  linear  feature  extraction.  This  paper  provides  a  detailed 
and  rigorous  analysis  of  the  two  methods  and  derives  conditions  under  which  these 
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methods  lead  to  identical  solution.  In  addition,  the  paper  briefly  addresses  the 
cumulant  based  criteria  for  ICA  as  well  as  Nonlinear  ICA. 


2.  LINEAR  INDEPENDENT  COMPONENT  ANALYSIS  AND 
INFORMATION  MAXIMIZATION 

Let  ^  be  random  vector  of  dimension  n  with  the  joint  probability  density  function 
p(^)  whose  covariance  matrix  is  nonsingular.  Furthermore,  let  M  be  a  linear 
square  map  which  maps  t  into  the  random  vector  p  whose  probability  density 
function  is  p  ( J)  . 

Definition  1:  ICA 

Linear  Independent  Component  Analysis  (ICA)  is  an  input/output  linear 
transformation  M  from  ^  to  J  such  that  the  output  components  with  joint 
probability: 

p(J)  -  p(yi-y„):  J  -  Mi 

are  “as  independent  as  possible”  according  to  the  appropriate  measure,  i.e.  distance 
(D: 


p(j)»np(yi) 

1 


(2) 


In  the  special  case  where  the  complete  independence  of  the  output  components  is 
achieved,  the  following  holds: 


p(yi-y„)  -  p(yi)-p(yn)  (3) 

If  the  input  vector  t  is  jointly  Gaussian,  ICA  is  equivalent  to  the  problem  of 
diagonalizing  the  output  covariance  matrix  which  is  the  standard  PCA 
problem.  In  order  to  guarantee  the  existence  of  the  solution  for  the  ICA  problem, 
we  assume  that  the  input  signal  ^  was  originally  obtained  by  the  invertible  linear 
mixture  of  the  statistically  independent  signals 


Definition  2:  Information  maximization 

Let  the  above  defined  random  vector  ^  be  transmitted  through  a  combination  of  a 
matrix  M  and  n  nonlinear  functions  f^  i  -  1  n  such  that  the  resulting 
components  of  the  output  vector  ^  are  defined  as: 

Wj  =  fi(yi)  J  (4) 

Under  the  assumption  that  the  every  nonlinear  function  fj  is  differentiable  and  that 
its  derivative  f.'  satisfies 
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oo 


(5) 


the  information  maximization  problem  is  defined  as  maximization  of  the  entropy 

HW  - -J  d\^  p(V^)  log(p(^))  (6) 

over  the  elements  of  matrix  M  and,  possibly,  the  free  parameters  in  the 
parameterization  of  f| .  Typical  choices  for  f.  are  single  or  neural  networks  with 
normalized  sums  of  sigmoidal  functions. 

At  first  glance  ICA  and  InfoMax  problems  seem  to  be  substantially  different. 
Nevertheless,  it  is  known  that  the  information  maximization  leads  to  the  statistical 
factorization  of  the  output  components  w. ,  i.e.  that  it  essentially  performs  the  same 
task  as  ICA  [20].  In  the  remaining  part  of  the  paper  we  give  a  rigorous  proof  that 
these  two  problems  are  identical  when  the  Kullback-Leibler  information  is  used  as 
a  measure  of  the  statistical  independence  in  ICA  and  when  the  derivatives  fj'  are 
capable  of  approximating  output  marginal  distributions  with  the  infinite  precision. 

The  Kullback-Leibler  distance  between  the  joint  and  the  marginal  probabilities  is 
defined  as: 


K{p(j),np(yi>>  "  J 


r 


p(?) 

np(yi) 

^  i 


>0 


(7) 


or  equivalently: 


K{p(5),np(yi)>  -  ZH(yi)-H(J)  (8) 

i  i-  1 

Equation  (8)  indicates  that  the  Kullback-Leibler  distance  is  the  mutual  information 
between  the  output  components  . 


The  relationship  between  the  input  and  output  joint  probabilities  of  a  differentiable 
map  g  is  equal  to: 

-  [S 

where  J  is  the  Jacobian  matrix  of  g .  Consequently,  the  relationship  between  the 
corresponding  entropies  is: 
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>0 


(11) 


H(p(out))  -H(p(m))+J  dm  p(m)  log  (|det  (J)|)  (10) 

Combining  equations  (7)  and  (8)  with  (9),  it  follows  that: 

V  i  / 

=  -H(p(jt))-J  dt  p(«)  log^np(yj)  IdetCNDlj&O 

Since  the  input  entropy  H(p(*))  is  independent  of  the  input-output 

transformation,  the  minimization  of  K {p (J), fjp (y^ }  is  equivalent  to 

i 

maximization  of  J  dJt  p(t)  log^^Hp (y^  •  Idet (M)| j ,  i.e.  to  the  Maximum 
Likelihood  Expectation  (MLE)  of  log^JJp(yj)  •ldet(M)|^.  In  general,  the 

analytical  expression  for  the  marginal  probabilities  p  (yj)  are  not  known,  and  their 
estimates  p  (yj)  have  to  be  obtained  from  the  data  for  every  change  of  the  matrix 
M. 


Similarly,  in  the  information  maximization  problem  the  output  joint  entropy 
H(\^)  is  equal  to: 


H(^)  -  I  dj  p(J)  log 


P(?) 


nfi’(yi) 

^  i 

-  K{p(j),nfi'(yi)} 

i 

-H(p(jt))+J  di  p(i)  log^n^'i'^yi) 

or,  equivalently,  to  the  MLE  of  log  (Ilf;  (y;)-|det(M)|j. 


(12) 


Hence,  the  ICA  with  the  Kullback-Leibler  information  measure  and  the  maximum 
information  transfer  as  defined  in  this  paper  are  posed  as: 
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ICA 


INFOMAX 


=>  min  K{p(?),nP(yi)} 
i 

=»  min  K{pW,nfi'(yi)} 
i 


(13) 


or,  equivalently,  to  the  following  MLE  over  the  input  probability  p  (^) : 


ICA 

INFOMAX 


MLE{log|^n?(y,) 
MLE{log^nfj'(yi)  -IdeKMXj) 


(14) 


A  MLE  formulation  of  the  InfoMax  method  is  also  discussed  in  [21]-[23].  The 
initial  parameterization  of  the  derivatives  fi'(yi)  in  (14)  has  a  possible 
interpretation  as  the  prior  on  the  estimation  of  the  actual  marginal  densities  p  (yj) . 
As  mentioned  earlier,  both  methods  require  parameterization  of  ^(yj)  and 
fj'(yj) .  Hence,  the  problem  statements  in  (13)  and  (14)  can  be  used  to  derive 
conditions  for  the  equivalence  of  solutions  of  ICA  and  InfoMax. 


Lemma: 

For  a  given  input  distribution  p  (^) ,  the  ICA  and  InfoMax  problems  achieve  the 
same  degree  of  statistical  independence  if  the  derivatives  fj'(yj)  can  be 
parameterized  in  the  form  of  the  marginal  distribution  estimates  ^  (yj)  . 


The  proof  is  straightforward  since  it  requires  that  the  parameterization  of  p  (^) 
and  fj*  (yj)  are  identical.  This  can  be  illustrated  on  an  example. 


Example: 

The  marginal  probabilities  p(yj)  have  to  be  estimated  from  the  data.  A  typical 
way  of  doing  that  is  to  estimate  elements  of  a  probability  density  function 
expansion  up  to  the  desired  order.  Let  us  use  the  first  element  of  the  Edgeworth 
expansion  119],  i.e.  let  f  (yj)  have  the  form  of  a  Gaussian  whose  mean  and 
standard  deviation  Oj  is  equal  to  the  those  of  the  actual  marginal  distribution 
P  (Yj)  •  Without  a  loss  of  generality  let  us  assume  that  the  input  distribution  p  (^) 
is  zero-mean.  In  addition,  let  us  parameterize  the  derivatives  fj'  (y.)  as  zero-mean 
Gaussian  distributions  whose  standard  deviations  r.  are  optimization  parameters. 
Hence,  it  is  easy  to  see  that  the  MLE  problems  in  (14)  become 
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ICA  =»  MLE{-5^[log(a|)] +log(|det(M)|)}  Oj  -  <y^ 


INFOMAX  =>  MLEi-V 

2- 

log  (f;)  +  \ 

1 

[  2rfJ 

+  log(|det(M)l) 


(15) 


The  resulting  ICA  problem  is  nothing  more  than  the  covariance  matrix 
diagonalization  [19]  where  the  optimization  is  performed  over  the  elements  of  the 
matrix  M.  In  the  case  of  InfoMax,  the  unknown  parameters  are  not  only  the 
elements  of  M  but  also  the  Gaussian  parameters  r^ .  It  is  easy  to  see  that  the 
optimal  value  of  rj  for  every  fixed  matrix  M  is  the  actual  standard  deviation  Oj 
and,  therefore,  that  the  solution  of  InfoMax  problem  will  also  result  in  the 
covariance  matrix  diagonalization. 


In  practice,  it  is  required  that  the  solutions  of  both  methods  are  unique  modulo 
transformations  that  preserve  statistical  independence  such  as  the  component  order 
permutation  and  diagonal  scaling.  The  uniqueness  is  achieved  if  the  number  of 
Gaussian  components  of  p(^)  does  not  exceed  one.  In  the  case  of  multiple 
Gaussian  distributions,  it  is  well  known  that  there  is  an  infinite  number  of  matrix 
transformations  that  diagonalize  the  covariance  matrix.  Hence,  the  ICA  and 
InfoMax  algorithms  blind  source  separation  will  have  unique  solutions  only  if  the 
original  signal  did  not  have  more  than  one  Gaussian  components.  In  addition, 
there  can  be  problems  concerning  the  scaling  of  the  elements  of  the  matrix  M. 
Hence,  it  is  the  experience  of  the  authors  that  imposing  the  condition 


det(M)  -  1  (16) 

makes  the  optimization  numerically  stable  and  avoids  possible  scaling  problems. 
Different  parameterizations  of  M  such  that  the  condition  in  (16)  holds  can  be  found 
in  [19]. 


3.  ALTERNATIVE  REDUNDANCY  MEASURES  AND  NONLINEAR  ICA 

The  previous  section  has  demonstrated  that  ICA  and  InfoMax  are  identical  when 
the  redundancy  measure  in  ICA  is  the  Kullback-Leibler  information  distance  and 
when  sufficient  freedom  is  given  to  the  marginal  output  probability  modeling  and 
estimation.  Nevertheless,  there  are  other  measures  that  are  easy  to  implement, 
especially  in  the  case  of  a  linear  mixing  with  a  matrix  M.  The  following  part  of  the 
paper  briefly  reviews  ICA  based  on  the  properties  of  cumulant  expansion  of  the 
joint  probability  density  function  p  ( J) .  The  detailed  derivation  and  analysis  of  the 
cumulant  based  ICA  can  be  found  in  [17]  and  [19]. 

The  cumulant  based  criterion  for  ICA  is  derived  by  comparison  of  the  cumulant 
expansion  of  the  joint  probability  density  p  ( J)  and  of  the  product  of  the  marginal 
output  probabilities  p  (yj)  .  The  complete  factorization  is  achieved  if  the  both 
expansions  are  the  same,  i.e.  if  the  non-diagonal  coefficients  in  the  higher  order 
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cumulants  of  p(J)  take  desired  values  (usually  zero)  imposed  by  the  statistical 
independence  of  p  (y^)  .  Since  the  cumulant  expansion  of  an  arbitrary  distribution 
has  infinite  number  of  elements,  for  practical  purposes  only  cumulants  up  to  the 
order  four  are  considered.  Hence,  the  resulting  ICA  cumulants  based  criterion  has 
the  following  form: 


4 


J(M)  -  £ 

i-  1 


I  [C 

nondiag 


(i) 

nondiag 


-c 


(i) 

nondiag -desired 


] 


2 


(17) 


where  i  defines  the  cumulant  order,  and  where  C^^ndiag  ^nondiag -desired 
are  the  non-diagonal  cumulant  coefficients  and  their  desired  values  for  a  given 
cumulant  order  i  of  the  joint  probability  density  function  p(J)  .  In  general,  the 
desired  coefficients  are  equal  to  zero.  For  every  change  of  the 

matrix  M ,  the  non-diagonal  coefficients  are  estimated  and  the  cost  function  J  (M) 
further  minimized. 


The  cumulant  based  ICA  criterion  can  be  further  simplified  by  using  the  properties 
of  cumulant  expansion  when  M  is  a  rotation  matrix  R  [17].  In  the  case  of  the 
rotation  matrix,  the  minimization  of  the  criterion  in  (17),  becomes  equivalent  to 
maximization  of  the  sum  of  squared  diagonal  elements. 


min  {J(R)}=max  I  [ci!agW]  f 

diag  ^ 

Consequently,  the  original  cumulant  cost  function  is  significantly  reduced  when  the 
linear  transformation  M  is  restricted  to  the  set  of  rotation  matrices.  Although  in 
general  M  is  not  a  rotation  matrix,  we  can  still  try  to  take  advantage  of  the  property 
described  in  (18). 

The  statistical  independence  implies  diagonal  structure  of  the  output  covariance 
matrix.  Hence,  let  N  be  an  invertible  matrix  which  diagonalizes  the  covariance 
matrix  of  the  input  variable 


N-Q^.N^-Dj  (19) 

where  Dj  is  a  nonsingular  diagonal  matrix  according  to  our  assumptions  about  the 
original  signals  and  the  mixing  process.  Then,  all  linear  input-output 
transformations  M  which  result  in  statistical  independence  of  output  components 
at  higher  order  while  preserving  the  diagonal  structure  of  the  covariance  matrix  of 
J  =  M  •  it  can  be  parameterized  as  follows  [17]: 

M  -  P.D*R-D]-0-5.N  (20) 

where  P  is  a  permutation  matrix,  D  is  an  invertible  diagonal  scaling  matrix,  and 
R  is  a  rotation. 
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Since  statistical  independence  is  invariant  to  the  diagonal  scaling  and  permutation, 
the  only  free  variable  after  the  scaling  with  •  N  is  the  rotation  matrix  R . 

The  suitable  parameterization  of  rotation  matrices  is  defined  [19]: 


R  =  (I  +  A)-Ul-A);  AT’ - -A  (21) 

where  A  is  the  skew-symmetric  matrix  whose  number  of  independent  parameters 
is  equal  to  0.5n  (n  -  1 ) .  This  parameterization  covers  all  rotation  matrices  R  with 
the  property  that  (I  +  R)  is  nonsingular,  i.e.  that  no  eigenvalue  of  R  is  equal  to 
-1 .  The  latter  condition  represents  no  restriction  in  our  case  since  there  always 
exists  a  diagonal  unitary  matrix  which  can  make  the  eigenvalues  different  from  -1 . 

Now,  the  cost  function  in  (17)  can  be  posed  as  the  optimization  problem  w.r.t.  the 
matrix  R ,  i.e.  A : 


RT  »  R-i 


max  X  Z 

i-1  diag 

(I  +  A)-^(I-A)  and  J  -  R  •  .  n  •  ^ 


(22) 


The  0.5n(n-  1)  free  parameters  of  the  matrix  A  can  be  determined  by  any 
gradient  based  method  such  as  backpropagation.  Furthermore,  the  diagonal 
elements  of  the  cumulant  tensors  are  the  cumulants  of  the  individual  elements  of  J 
which,  on  the  other  hand,  are  polynomial  functions  of  the  cross-cumulant  elements 
of  it.  Consequently,  the  optimization  can  be  significantly  simplified  by  pre¬ 
calculated  the  cumulants  of  it  up  to  the  other  i .  It  is  the  experience  of  the  authors 
that  the  cumulant  based  ICA  criterion  in  (22)  is  numerically  superior  to  the 
Kullback-Leibler  distance  based  ICA.  Several  applications  of  the  cumulant  based 
blind  source  separation  method  presented  in  (22),  including  the  “Cocktail  Party” 
example  and  the  mixture  of  the  uniform  distributed  signals,  can  be  found  in  [19]. 

As  the  ]ast  point  of  this  section,  the  authors  would  like  to  mention  that  the  ICA 
problem  can  be  formulated  also  in  the  case  where  the  input-output  map  is  not  a 
matrix  but  an  invertible  nonlinear  function  F.  A  parameterization  of  such  functions 
with  the  so  called  “triangular  volume  preserving  network”  is  presented  in  [12]  and 
[19].  The  reference  [19]  presents  severa]  applications  of  the  Nonlinear  ICA. 


4.  CONCLUSIONS 

This  paper  has  provided  a  detailed  and  rigorous  analysis  of  the  two  commonly  used 
methods  for  blind  source  separation:  Linear  Independent  Component  Analysis 
(ICA)  and  Information  Maximization  (InfoMax).  The  paper  showed  analytically 
that  ICA  based  on  the  Kullback-Leibler  information  as  a  mutual  information 
measure  and  InfoMax  lead  to  the  same  solution  if  the  parameterization  of  the 
output  nonlinear  functions  in  the  latter  method  is  sufficiently  rich.  Furthermore, 
this  work  has  discussed  the  alternative,  cumulant  based  blind  source  separation 
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methods  and  Nonlinear  ICA.  The  practical  issues  of  applying  ICA  and  InfoMax 
were  also  discussed. 
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Abstract 

In  this  paper  we  present  a  simple  efficient  local  unsupervised  learning 
algorithm  for  on-line  adaptive  multichannel  blind  deconvolution  and  sepa¬ 
ration  of  i.i.d.  sources.  Under  mild  conditions,  there  exits  a  stable  inverse 
system  so  that  the  source  signals  can  be  exactly  recovered  from  their  con- 
volutive  mixtures.  Based  on  the  existence  of  the  inverse  filter,  we  construct 
a  two-stage  neural  network  which  consists  of  blind  equalization  and  source 
separation.  In  blind  equalization  stage,  we  employ  anti-Hebbian  learning 
in  temporal  domain  for  decorrelation.  For  blind  separation,  we  can  ap¬ 
ply  any  existing  algorithms.  Extensive  computer  simulations  confirm  the 
validity  and  high  performance  of  our  proposed  learning  algorithm. 


1  INTRODUCTION 

Blind  signal  separation  from  convolutive  mixtures  of  unknown  source  signals  is  a 
really  challenging  and  fundamental  problem  encountered  in  many  applications  such 
as  cocktail  party  problem,  wideband  array  signal  processing,  image  processing,  dig¬ 
ital  communication,  and  some  biomedical  applications.  In  real  world  applications, 
the  observed  signals  obtained  from  sensors  are  usually  convolutive  mixtures  due  to 
the  propagating  source  signals  through  the  dynamic  medium  and  parasitic  effects 
like  multiple  echoes  and  reverberation.  In  this  paper,  we  present  a  new  approach  to 
multichannel  blind  deconvolution  and  separation  of  i.i.d.  source  signals.  In  multi¬ 
channel  deconvolution  and  separation,  an  m  dimensional  vector  of  received  signals 
x{k)  is  assumed  to  be  generated  from  an  n  dimensional  vector  of  independent  source 
signals  s{k)  using  the  multi-variate  linear  time  invariant  filters,  i.e., 

M 

x(fc)  =:  J2Kis{k-i)-^n{k), 

=  [H(2)]s(fc)+n(fc),  (1) 
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where  H(z)  =  E,^o  (H(z)  ism  X  n  polynomial  matrix  and  z  ^  is  delay 

operator  such  that  z~'^s{k)  -  s(k  -  i))  is  the  channel  transfer  function  and  n{k) 
is  an  additive  white  Gaussian  noise.  We  assume  that  the  number  of  sensors,  m  is 
strictly  greater  than  the  number  of  sources,  n.  The  problem  of  multichannel  de- 
convolution  and  separation  is  to  recover  the  source  signals  s{k)  from  the  received 
signals  x(/c),  up  to  scale  factor,  permutation  ambiguity,  and  an  arbitrary  delay,  i.e., 
s{k)  —  PAD{z)s{k),  where  P  is  a  permutation  matrix,  A  is  a  nonsingular  diagonal 
matrix,  and  0(2)  =  diag{z~^^ ,  •  •  •  ,  2“^”  }. 

Most  existing  methods  was  devoted  to  recover  the  spatially  independent  source 
signals  from  their  convolutive  mixtures.  The  output  decorrelation  approach  was  pro¬ 
posed  by  Compernolle  and  Gerven  [1 1]  and  further  investigated  by  [25,  5,  16].  The 
methods  developed  in  the  separation  of  instantaneous  mixtures  have  been  extended 
to  multi-channel  deconvolution  problem  (see  [15,  24,  10,  2]).  Bussgang  approach 
in  frequency  domain  has  been  employed  by  [17,  18,  19].  Most  of  time-domain 
method  failed  to  recover  the  source  signals  when  the  channels  are  nonminimum 
phase  systems.  To  overcome  this  difficulty,  frequency-domain  approach  are  sug¬ 
gested  [17,  18,  19].  However,  the  frequency-domain  approach  is  block-adaptive 
because  it  requires  to  compute  FFT.  Thus,  time-domain  approach  is  better  if  we  can 
handle  the  nonminimum  phase  channels  properly.  In  most  of  aforementioned  meth¬ 
ods,  the  number  of  sensors  are  assumed  to  be  equal  to  the  number  of  sources  and 
was  usually  restricted  to  the  case  of  only  two  source  signals.  As  will  be  shown  in 
this  paper,  a  stable  exact  inverse  system  for  the  convolutive  mixture  model  (1)  does 
not  exist  if  there  are  equal  number  of  sources  and  sensors,  except  for  trivial  cases. 

In  this  paper,  we  investigate  the  separation  of  i.i.d.  source  signals  (precisely 
speaking,  spatially  independent  and  temporally  uncorrelated  source  signals)  and  de¬ 
velop  an  on-line  adaptive  algorithm.  Signal  separation  task  is  split  into  two  stages: 
spatio-temporal  decorrelation  and  blind  source  separation  of  instantaneous  mixtures. 
As  will  be  shown  in  this  paper,  blind  equalization  (based  on  second-order  statistics 
only)  is  able  to  deconvolve  the  Multi  Input  Multi  Output  (MIMO)  FIR  channels  up 
to  linear  mixtures  of  source  signals.  Source  separation  is  employed  to  separate  the 
instantaneous  mixtures. 

2  THEORETICAL  FUNDAMENTALS,  BASIC  AS¬ 
SUMPTIONS 

Throughout  this  paper,  the  following  model  assumptions  are  made: 

Al)  The  source  s{k)  is  zero-mean  with  non-zero  variance,  temporally  uncorre¬ 
lated,  and  spatially  independent,  i.e.. 


E{sm} 

=  o,v*, 

(2) 

E{sKk)} 

(3) 

E{si{k)si{k  -  r)} 

=  0,  Vt5^0, 

(4) 

E{si(k)sj{k  -  t)} 

=  0)  Vr,  i  j, 

(5) 
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where  E  denotes  the  expectation  operator. 

A2)  The  number  of  sensors,  m  is  strictly  greater  than  the  number  of  sources  n, 
i.e.,m  >  n. 

In  blind  equalization  of  Single  Input  Multi  Output  (SIMO)  FIR  channels,  it  has 
been  shown  that  under  mild  conditions  (for  example,  channels  do  not  have  com¬ 
mon  zeros),  temporal  decorrelation  of  the  composite  output  (sum  of  the  output  of 
equalizer-bank)  can  equalize  the  channels,  (see  [20,  6,  7]  for  details).  This  is  ex¬ 
tended  to  blind  equalization  of  MIMO  FIR  channels.  In  this  section,  we  present 
fundamental  theoretical  results:  (1)  Under  what  conditions,  a  stable  inverse  system 
(equalizer)  for  MIMO  FIR  channels  exists?;  (2)  if  there  exists  a  equalizer  for  MIMO 
FIR  channels,  how  can  we  equalize  channels?  Let  W  (z)  be  the  equalizer  of  MIMO 
FIR  channel  H(<2:)  (m  x  n  polynomial  matrix).  For  a  given  channel  a.  zero¬ 
forcing  condition  for  blind  equalization  of  MIMO  channels  is  given  by 

W(2)H(^)  =  PAD(z).  (6) 

The  transfer  function  of  the  channel,  H(^)  is  an  m  x  n  polynomial  matrix  having  (^) 
distinct  n  x  n  submatrices.  Let  ^i(z),  i  =  1, 2,  •  •  •  ,  (^),  denote  the  determinants 
of  these  submatrices.  Let  GCD  denote  the  greatest  common  divisor.  In  terms  of 
these  quantities,  the  existence  of  FIR  equalizer  W {z)  is  given  by  Massey  and  Sain 
[22]. 

Theorem  1  An  FIR  inverse  system  exists,  if  and  only  if 


GC£>[Ai(z),A2(z),---  =  (7) 


for  some  d>0. 

In  particular,  such  FIR  equalizer  does  not  exit  if  Ai(z)  have  common  zeros  (except 
common  zeros  at  origin).  Note  that  this  condition  can  not  be  satisfied  for  the  case 
where  we  have  equal  number  of  sensors  and  sources,  except  for  trivial  cases.  Let 
G(z)  be  the  global  system  described  as  G(2:)  =  W(2)H(2).  Then  G{z)  is  also 
FIR,  of  the  form 


p 

G(z)  =  ^Gi2-^  (8) 

where  P  is  the  upper  bound  of  the  order  of  G(2:).  We  generalize  the  zero-forcing 
condition  (6)  as 


W(z)H(z)  =  rD(;^)  (9) 

where  T  is  an  m  x  m  nonsingular  matrix.  This  generalized  zero-forcing  condition 
(9)  has  been  recently  investigated  by  R.  Liu  [21]  when  D(z)  =  lz~^. 
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Theorem  2  Let  the  channel  H(2:)  satisfy  the  condition  (7).  Suppose  that  the  as¬ 
sumptions  (Al)  and  (A2)  are  satisfied.  Then  the  generalized  zero-forcing  condition 
(9)  is  satisfied  ify{k)  satisfies 

E{y{k)y'^il)}  =  r4,,  (10) 

where  Ski  Is  Kronecker  delta  equal  1  for  k=l,  otherwise  0. 

Sketch  of  proof:  Both  ll{z)  and  W(^)  are  FIR,  so  G{z)  =  W(2)H(^)  is  also 
FIR.  Then  y{k)  =  Gis{k  —  i).  It  is  easy  to  see  that  if  y{k)  satisfies  (10),  then 
the  generalized  zero-forcing  condition  (9)  is  satisfied. 

Thus  we  can  equalize  the  MIMO  channel  H(^)  by  spatio-temporal  decorrelation 
of  the  output  y{k)  up  to  linear  mixtures.  Note  that  this  result  is  consistent  with 
some  results  in  [12,  13].  Suppose  that  VL[z)  is  full  rank  for  all  This  implies  that 
ll(z)  is  a  minimum  phase  system.  Then  the  source  signals  s(A:)  can  be  viewed  as  the 
normalized  innovation  sequence  of  the  observed  sequence  x{k).  One  can  recover  the 
innovation  sequence  from  x(A;)  up  to  an  orthogonal  matrix  Q.  This  is  emphasized 
in  [12, 13]  where  whitening  is  done  by  linear  prediction.  In  contrast  to  [12,  13],  our 
approach  is  more  robust  in  the  sense  that  we  do  not  need  exact  knowledge  of  the 
order  of  channels. 

3  NEURAL  NETWORK  MODEL  AND  THE  LEARN¬ 
ING  ALGORITHMS 

Let  W{z)  be  a  generalized  zero-forcing  equalizer  and  U  E  be  a  demixing 

memoryless  network.  Provided  that  the  channel  K{z)  satisfies  the  condition  (7), 
then  UW(2:)  is  a  stable  inverse  system  of  the  FIR  channel  Tl{z),  and  the  golbal 
system  G(5r)  should  satisfy  the  relation 

G(^)  =  UW{z)H{z)  =  PAB{z).  (1 1) 

Our  approach  to  multichannel  blind  deconvolution  is  illustrated  in  Figure  1 . 


r - - - 


Figure  1:  Neural  network  block  diagram  for  multichannel  blind  deconvolution: 
W (z)  represents  a  generalized  zero-forcing  recurrent  equalizer  and  matrix  U  de¬ 
scribes  a  demixing  feedforward  memoryless  network. 
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For  a  generalized  zero-forcing  equalizer  W(2:),  we  consider  a  linear  feedback 
network  shown  in  Figure  2.  This  network  is  almost  fully  connected  in  spatial  and 
temporal  domain.  Note  that  it  does  not  have  algebraic  loops  (the  connections  be¬ 
tween  yi{k)  and  yj{k)  and  self-connections  from  yi{k)  to  yi{k)),  T  he  generalized 
zero-forcing  equalizer  can  be  viewed  as  a  whitening  filter  or  spatio-temporal  decor¬ 
relation  filter.  TTie  output  yiik)  is  described  by  (see  Figure  2) 

m  L 

yi{k)  =  Xi{k)  ~\-Y^'^Wijp{k)yj{k  -  p),  fori  =  (12) 

j=i  p=i 

where  the  synaptic  weight  Wijp{k)  is  the  connection  strength  between  yi{k)  and 
yj{k  -  p).  Or  in  matrix  form 

L 

y(fc)=x(fc)  +  ^Wp(fc)y(fe-p),  (13) 

p=l 

where  y(A:)  =  [yi{k),--  ,  2/m(^)]^»  x(A:)  =  [xi{k),-’-  ,  and  Wp(/:)  = 

is  synaptic  weight  matrix.  From  (13),  it  is  easy  to  find  that  the  equalizer 
W (z)  can  be  expressed  as 

L 

Wiz)  =  (l-J^WpZ-PyK  (14) 

P=1 

For  spatio-temporal  decorrelation  of  the  output  signals  y{k),'WQ  apply  a  simple 
learning  algorithm  which  is  a  temporal  variant  of  anti-Hebbian  learning.  The  weight 
matrices  Wp(k)  for  p  =  1,  •  •  •  ,  L  are  updated  by  the  following  learning  algorithm: 

Wpik  +  l)  =  Wp{k)-r,{k){y{k)y'^ik-p)},  (15) 

where  r]{k)  >  0  is  a  learning  rate.  It  is  easy  to  see  that  the  above  algorithm  (15) 
achieves  the  convergence  for 

S{y(A;)y^(fe-p)}  =  0,  forp=l,--  -  ,L.  (16) 

Thus  at  steady  state  y{k)  =  rD(2:)s(A:).  Note  that  all  stable  equilibria  of  (16)  are 
desirable  solution  where  the  output  y  (A;)  are  uncorrelated  in  spatio-temporal  domain. 
Note  that  the  learning  algorithm  (15)  for  the  generalized  zero-forcing  equalizer  is 
very  simple,  local,  and  biologically  plausible.  After  spatio-temporal  decorrelation 
by  a  generalized  zero-forcing  equalizer,  the  output  y{k)  consists  of  instantaneous 
mixtures  of  the  source  signals.  At  second  stage,  source  separation  feedforward  mem¬ 
oryless  network  is  implemented,  described  as 

z{k)  =  Vik)y{k).  (17) 

For  the  separation  of  instantaneous  mixtures,  any  existing  methods  [14, 9,  3, 1, 4,  8] 
can  be  employed.  In  this  paper,  we  employ  the  following  source  separation  algo¬ 
rithm  [1]: 

V{k  -h  1)  =  V{k)  -f  p(A:){I  -  f{z{k))z'^{k)}\J{k),  (18) 
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Figure  2:  The  proposed  linear  temporal  feedback  network  for  the  generalized  zero¬ 
forcing  equalizer 

where  f(z(fc))  =  [f{zi{k)),  •  •  •  ,  fizn{k)]'^  is  properly  chosen  nonlinear  function. 
The  choice  of  this  function  depends  on  the  statistics  of  source  signals.  For  example, 
it  is  known  that  f{zi{k))  =  Pzi[k)  +  2? (fc)  is  for  sub-Gaussian  source  signals  and 
f{zi{k))  ~  pzilk)  -f  tanh(a2i(/?))  is  for  super-Gaussian  source  signals  for  some 
^>0. 

4  COMPUTER  SIMULATIONS 

One  exemplary  simulation  result  is  presented  here.  In  this  simulation,  three  i.i.d. 
sources  and  five  sensors  were  used.  Three  i.i.d.  sources  consist  of  random  variables 
that  are  uniformly  distributed  over  the  binary  set  {-1,  -f  1}.  Five  received  signals 
x{k)  were  generated  by 

x(/c)  =  Hos(fc)  -f  His(fc  -1)+  ll2s{k  -  2) 

+H5s(fc  -  5)  +  Hios(A:  -  10).  (19) 

The  weighting  coefficients  in  miximg/convolutive  model  ,  i  =  0,1,2,5,10 
were  generated  randomly  over  the  interval  [—1,1].  They  were  assumed  to  be  com- 
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pletely  unknown. 


'  -0.8244 

-0.5281 

0.2690 

■  -0.7599 

-0.2378 

0.8745 

0.2790 

-0.4718 

0.9545 

0.7246 

-0.2266 

-0.2742 

-0.8027 

0.2088 

0.6510 

.Hi  = 

0.0496 

-0.3765 

0.3459 

0.3813 

-0.1638 

-0.4125 

0.9096 

0.1851 

0.7272 

-0.3169 

-0.7274 

-0.2183 

-0.0720 

-0.2393 

0.7922 

0.3430 

-0.0721 

0.3401 

0.3381 

-0.5663 

0.1081 

0.0393 

0.8915 

-0.7729 

0.9698 

-0.2780 

0.1904 

0.3706 

-0.0686 

-0.0818 

,  H5  = 

-0.0709 

0.1803 

0.3893 

-0-9449 

-0.9798 

-0.7116 

-0.1861 

-0-5072 

-0.9113 

-0.8525 

0.4407 

-0.3565 

0.8387 

-0.7378 

0.3675 

-0.1154 

-0.0173 

0.6804 

0.4578 

0.4206 

-0.6061 

-0.3082 

-0.3343 

-0.1155 

0.1367 

-0.2402 

0.2099 

-0.9465 

0.9707 

0.1975 

The  eye  pattern  is  shown  in  Figure  3.  Figure  4  shows  the  original  source  signals 
s{k),  convolutive  mixtures  x(/?),  and  recovered  signals  z{k).  Only  51  samples  for 
each  are  plotted.  It  can  be  observed  that  zi{k)  =  -S2{k),  Z2ik)  =  si{k),  and 
Z3{k)  =  53 (fc).  From  Figure  4,  we  can  see  that  the  recovery  of  i.i.d.  source  signals 
from  their  convolutive  mixtures  are  perfect  even  for  case  of  nonminimum  phase 
channels. 


Figure  3:  The  eye  pattern  of  recovered  signals:  (a)  zi{k);  (b)  Z2ik);  (c)  zs{k). 


5  CONCLUSION 

We  have  presented  a  new  adaptive  scheme  to  blind  separation  of  source  signals  from 
their  convolutive  mixtures.  Fundamental  theoretical  results  and  implementation  have 
been  presented.  Under  mild  conditions,  we  have  shown  that  i.i.d.  source  signals 
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Figure  4:  (a) 
signals,  xi{k) 
~S2(k),  Z2(k] 


can  be  recovered  by  the  generalized  zero-forcing  equalizer  and  instantaneous  blind 
separation.  A  linear  feedback  network  with  associated  anti-Hebbian  learning  has 
been  constructed  to  perform  spatio-temporal  decorrelation.  Computer  simulation 
experiments  demonstrated  validity  and  high  performance  of  our  proposed  approach. 
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Abstract-  Blind  deconvolution  and  separation  of  linearly  mixed  and 
convolved  sources  is  an  important  and  challenging  task  for  nu¬ 
merous  applications.  While  several  recently-developed  algorithms 
have  shown  promise  in  these  tasks,  these  techniques  may  fail  to 
separate  signal  mixtures  containing  both  sub-  and  super- Gaussian- 
distributed  sources.  In  this  paper,  we  present  a  simple  and  eificient 
extension  of  a  family  of  algorithms  that  enables  the  separation  and 
deconvolution  of  mixtures  of  arbitrary  non-Gaussian  sources.  Our 
technique  monitors  the  statistics  of  each  of  the  outputs  of  the  sepa¬ 
rator  using  a  rigorously- derived  sufficient  criterion  for  stability  and 
then  selects  the  appropriate  nonlinearity  for  each  channel  such  that 
local  convergence  conditions  Of  the  algorithm  are  satisfied.  Exten¬ 
sive  simulations  show  the  validity  and  efficiency  of  our  method  to 
blindly  extract  mixtures  of  arbitrary-distributed  source  signals. 

I.  INTRODUCTION 

Blind  signal  separation  is  useful  for  numerous  problems  in  biomedical 
signal  analysis,  acoustics,  communications,  and  signal  and  image  processing. 
In  blind  source  separation  of  instantaneous  signal  mixtures,  a  set  of  measured 
signals  {a;i(ib)},  1  <  f  <  n  is  assumed  to  be  generated  from  a  set  of  unknown 
stochastic  independent  sources  {si(A?)},  1  <  i  <  m,  m  <  n  as 

x(fc)  -  Hs(A^),  (1) 

where  x{k)  =  [a:i(^)  •  •  •  s(^)  =  [si{k)  •  •  •  and  H  is  an  (n  x 

77i)-dimensional  matrix  of  unknown  mixing  coefficients  {hij}.  The  measured 
sensor  signals  are  processed  by  a  linear  single-layer  feed-forward  network  as 

y(^  =  W(k)x{k),  (2) 
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where  y(^)  =  [yi{k)  ■  -ymik)]'^  and  W(Ar)  is  an  (m x  n)-din[iensional  synaptic 
weight  matrix.  Ideally,  W{k)  is  adjusted  iteratively  such  that 

limW(^)H  =  PD,  (3) 

k-t-oo  '  ^ 

where  P  is  an  (m  x  m)-dimensional  permutation  matrix  with  a  single  unity 
entry  in  any  of  its  rows  or  columns  and  D  is  a  diagonal  nonsingular  matrix. 

Recently,  several  simple,  efficient,  and  robust  iterative  algorithms  for  ad¬ 
justing  W(^)  have  been  proposed  for  the  blind  signal  separation  task  [1]-[12]. 
Such  methods  use  higher-order  statistical  information  about  the  source  signals 
to  iteratively  adjust  the  coefficient  matrix  W{k).  In  this  paper,  we  consider 
one  class  of  on-line  adaptive  algorithms  given  by  [1] 

W(^-bl)  =  Wik)-^r){k)[l~i{y{k))y'^ik)]W(k),  (4) 

where  i{y{k))  =  [fi(yi{k))  •  •  •/m(ym(Ar))]^.  The  optimal  forms  of  the  non¬ 
linear  functions  {fi(y)]  can  be  shown  to  be  dependent  on  the  statistics  of 
the  source  signals  [2,  7,  8].  For  example,  if  the  signal  mixture  consists  of 
sub-Gaussian  sources  with  negative  kurtoses,  the  choices  fi(y)  =  f^iy)  = 
|y|^sgn(?/)  for  p  =  {2,3,.. .}  provide  adequate  separation  capabilities.  For 
mixtures  of  super-Gaussian  sources  with  positive  kurtoses,  the  choice  fi{y)  = 
fpiy)  =  tanh(ay)  with  a  >  0  can  be  used  [3,  11,  12]. 

A  related  task  to  blind  signal  separation  is  that  of  multichannel  signal 
deconvolution,  in  which  x{k)  is  assumed  to  be  produced  from  s{k)  as 

oo 

*W=  Hps(fc-p),  (5) 

p  =  — CO 

where  is  an  (n  x  m)-dimensional  matrix  of  mixing  coefficients  at  lag  p.  The 
goal  is  to  calculate  a  vector  y(k)  of  possibly  scaled  and/or  delayed  estimates 
of  the  source  signals  in  s(Ar)  from  x{k)  using  a  causal  linear  filter  given  by 

L 

y{k)  =  y]Wp«x(fc-p),  (6) 

p=0 

where  the  (m  x  n)-dimensional  matrices  {Wp(A?)},  0  <  p  <  L  contain  the 
coefficients  of  the  multichannel  filter.  One  algorithm  that  can  be  used  in  this 
task  is  described  in  [5,  6]  and  is  given  by 

Wy(fc+1)  =  W,(i)+,;(fc)[Wp(<:)-f(y(fc-L))u^(i-p)],  (7) 

where  the  n-dimensional  vector  u(k)  is  computed  as 

u{k)  =  ^WL,(fc)y(fc-g).  (8) 

q=0 
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This  algorithm  reduces  to  that  in  (4)  for  T  =  0. 

Although  simulations  have  indicated  that  these  algorithms  are  successful 
at  separating  and  deconvolving  linearly-mixed  signals,  they  require  knowledge 
about  the  statistics  of  the  source  signals  to  function  properly.  In  particular,  it 
must  be  known  a  priori  whether  the  source  signals  are  sub-Gaussian  or  super- 
Gaussian  so  that  the  nonlinearities  fi{y)  can  be  properly  chosen.  Even  worse, 
if  the  measured  signals  Xi{k)  contain  mixtures  of  both  sub-Gaussian  (e.g. 
digital  data)  and  super- Gaussian  (e.g.  speech)  sources,  then  these  algorithms 
may  fail  to  separate  these  signals  reliably. 

In  this  paper,  we  propose  modifications  to  the  algorithms  in  (4)  and  (7) 
that  enable  sources  from  arbitrary  non-Gaussian  distributions  to  be  extracted 
from  measurements  of  the  mixed  signals.  Our  methods  use  simple  sufficient 
conditions  for  algorithm  stability  that  are  based  on  the  necessary  stability 
conditions  originally  derived  by  Amari  et  al  [3] .  Our  computationally-simple 
algorithms  employ  time- varying  nonlinearities  in  the  coefficient  updates  that 
are  selected  from  a  family  of  fixed  nonlinearities  at  each  iteration  to  best 
satisfy  our  sufficient  stability  conditions.  Simulations  show  the  excellent  and 
robust  convergence  behavior  of  the  proposed  methods  in  separating  mixtures 
of  sub-  and  super-Gaussian  sources. 

IL  CRITERIA  FOR  ALGORITHM  STABILITY 

The  modified  algorithms  for  separation  of  sources  with  arbitrary  distribu¬ 
tions  are  based  on  the  stability  analysis  of  (4)  that  is  described  in  [3].  For 
brevity  and  simplicity  of  discussion,  we  only  consider  (4)  and  outline  the 
necessary  extensions  of  the  analysis  that  are  needed  to  develop  our  modified 
algorithms,  although  we  later  apply  the  results  to  the  multichannel  deconvo¬ 
lution  method  in  (7). 

The  algorithm  in  (4)  can  be  derived  as  an  iterative  stochastic  minimization 
procedure  for  the  cost  function 

<P(W{k))  =  -ilog(det(W(^)W(Ar)^))  -  (9) 

^  i-i 

where  E{-}  denotes  statistical  expectation  and  -d\ogpi{y)/dy  =  fi{y)-  If 
Pi{y)  is  the  actual  probability  distribution  of  the  source  extracted  at  the  ith 
output,  then  (l){'W{k))  represents  the  negative  of  the  maximum  likelihood  cost 
function  [2,  7].  The  procedure  in  (4)  represents  the  natural  gradient  method 
for  minimizing  (9)  iteratively  using  signal  measurements.  For  details  on  the 
general  form  of  the  natural  gradient  search  method,  the  reader  is  referred  to 
[!]• 

In  [3],  the  stability  of  (4)  is  analyzed  by  studying  the  expected  value  of 
the  Hessian  of  the  cost  function,  denoted  as  E{d‘^<l>(W{k))/dwij{k)dwpq{k)}, 
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in  the  vicinity  of  a  source  separation  solution.  Here,  Wij{k)  is  the  {i,j)ih 
element  of  W(k),  which  in  [3]  is  assumed  to  be  a  square  matrix  (m  —  n). 
In  what  follows,  we  remove  this  restriction.  In  analogy  with  the  results  of 
[3],  it  is  simpler  to  consider  the  form  of  (P(j)(W{k))  in  terms  of  the  modified 
coefficient  differential 

dX{k)  =  dW{k)W'^{k)(W(k)W'^{k))-\  (10) 

such  that 

dX(k)y(k)  =  dW{k)x{ky  (11) 

Note  that  the  natural  gradient  method  automatically  performs  its  search  in 
the  coefficient  space  spanned  by  dX{k),  so  that  the  coefficient  updates  remain 
in  the  original  column  space  of  W(0)  for  all  k.  We  can  then  represent  the 
differential  d'^<t){W{k))  in  terms  of  the  elements  of  dX{k)  as 

dmw(k))  =  y'^(k)dX^{k)F',{y{k))dX{k)y{k) 

^f'^{y{k))dX{k)dX{k)yik),  (12) 

where  Fjj(y(fc))  is  a  diagonal  matrix  whose  (2,i)th  entry  is  fl{yi(k)). 

As  is  shown  in  [3],  the  expectations  of  the  terms  on  the  RHS  of  (12)  are 

m  m 

E{y^{k)dX'^{k)F'M'‘)W‘)yik)}  =  E  E  <^i{k)Kj{k)[dxji(k)f 

i=l  j=zl,jjLi 
m 

+  ■(*)]'  (13) 

*  =  1 
fh  m 

E{f'^{y{k))dX{k)dX{k)yik)}  =  (14) 

i=l i=l 

where  pii{k),  (rf{k),  Kj{k),  and  pi{k)  are  defined  as 

«(fc)  =  ^^{2/?(*)/i(w(l^))}.  crUk)  =  E{yf(k)},  (15) 

Kj{k)  =  E{f’i(yi{k))},  and  Pi{k)  =  E{yi{k)f{yiik))},  (16) 

respectively,  and  where  it  has  been  assumed  that  yi{k)  and  yj{k)  are  inde¬ 
pendent  for  i  ^  j‘  Thus,  the  expected  value  of  the  Hessian  is 

m  m 

E{d‘^<l,{W{k))]  =  E  E  a-iKj{k)[dxji(k)]^  +  pi(k)dxij{k)dxji{k) 

*=i 

m 

*=i 
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For  stability,  E{d^(i){W(k))}  must  be  positive  for  all  possible  values  of 
dxij{k).  By  examining  the  RHS  of  (17),  one  can  obtain  the  following  neces¬ 
sary  and  sufficient  stability  conditions  on  fi{y)  for  all  1  <i  <  j  <  m: 


Ki{k)  > 

0 

(18) 

Pi(k) pi{k)  > 

0 

(19) 

j(k)Ki{k)<7]{k)Kj{k)  > 

\[pi{k)+pj{k)f. 

(20) 

The  conditions  in  (18)  and  (19)  are  satisfied  in  practice  for  any  odd  non¬ 
decreasing  function  fi{y)  =  -fi{-y)-  However,  the  condition  in  (20)  is  diffi¬ 
cult  to  calculate  in  practice,  as  it  involves  m(m—  l)/2  different  combinations 
for  1  <  i  <  j  <  m.  For  this  reason,  we  consider  a  sufficient  stability  criterion 
of  the  form 


(rf{k)Ki(k)(T]{k)Kj{k)  >  7^(k)pi{k)pj{k),  (21) 

where  j{k),  j{k)  >  0  satisfies  for  all  1  <  i  <  ;  <  m  the  inequality 

7^k)pi{k)pi(k)  >  \\pm  +  Pj{k)f .  (22) 

After  some  algebra,  we  find  that  the  smallest  value  of  ^{k)  satisfying  (22)  is 


t(^) 


1  /o  J_  I  Pminjk) 

2\/  /?min(^)  Pmax{k) 


(23) 


where  Pmaxi^)  and  pmin{k)  are  the  maximum  and  minimum  values  of  pi(k) 
for  1  <  i  <  m,  respectively.  For  this  value  of  j{k),  we  can  guarantee  the 
stability  of  the  algorithm  in  (4)  if,  for  all  1  <  i  <  m, 

(Tf{k)Ki{k)-j(k)pi{k)  >  0.  (24) 


Note  that  all  values  of  pi{k)  converge  to  one  as  the  coefficients  converge  to  a 
separating  solution  due  to  the  normalizing  condition  -E^{f(y(^))y^(^)}  ==  I? 
such  that  7(fc)  «  1  near  convergence. 


III.  THE  ALGORITHM  MODIFICATION 

We  now  describe  the  modified  algorithms  that  are  based  on  the  stability 
criteria  of  the  last  section.  It  is  known  that  the  algorithm  in  (4)  can  determine 
a  separating  solution  for  W{k)  if  a  set  of  nonlinearities  {fiiy)},  I  <  i  <m 
can  be  properly  chosen.  For  this  reason,  our  modified  algorithms  employ 
a  time-varying  vector  nonlinearity  fjfc(y(^))  =  [fikiyi{^)) "  ‘ '  fmk{ym{k)]  , 
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where  fik{y)  is  chosen  to  be  one  of  two  nonlinearities  /iv(y)  and  fp{y)  that  are 
optimized  for  sub-  and  super-Gaussian  source  separation  tasks,  respectively. 
To  select  fik{y)  at  time  k,  we  form  the  time-averaged  estimates 

ai{k)  =  {l-6)ai{k-l)^6\yi{k)\^  (25) 

Kir{k)  =  {l-6)Kir{k-l)-\-6f'{yi{k))  (26) 

p,,{k)  =  {l-6)pirik-l)-\-6yi(k)fr{yi{k))  (27) 

for  r  =  {N,  P)  and  1  <  *  <  m,  where  <5  is  a  small  positive  parameter.  Then; 
fik{y)  at  time  k  is  selected  as 


if  a^{k)KiN{k)  -  j{k)piN{k)  >  (T^{k)Kip{k)  -  y{k)pip{k)  .  . 
if  (Tf(k)KiN(k)  -  7{k)piN{k)  <  <rf{k)Kiplk)  -  ylk)piplk)  ^  ^ 


where  7(^)  is  computed  at  infrequent  intervals  from  past  estimates  of  pi{k). 
With  these  choices,  the  resulting  vector  fk{y{k))  is  used  in  place  of  f(y(it)) 
to  adjust  the  coefficient  matrix  in  (4).  In  simulations,  it  was  found  that 'the 
value  of  j{k)  did  not  vary  significantly  over  time,  and  in  fact,  setting  y{k) 
equal  to  one  in  (28)  for  all  k  appears  to  provide  convergence  of  the  algorithms 
to  a  separating  solution. 

It  can  be  seen  that  as  the  coefficients  of  the  system  converge,  the  quantity 
(Tf(k)Kir{k)  —  pir{k)  becomes  a  reliable  estimate  of  the  left-hand-side  of  the 
inequality  in  (24)  for  fikiy)  =  /r(y).  Extensive  simulations  have  shown  that, 
so  long  as  a  set  of  nonlinearity  assignments  exists  such  that  the  stability 
conditions  in  (18)”(20)  are  satisfied  for  one  ordering  of  the  extracted  sources 
at  the  outputs,  then  (28)  properly  selects  each  fik(y)  over  time  to  enable  the 
system  to  reliably  extract  all  source  signals  regardless  of  their  distribution. 


IV.  SIMULATIONS 


We  now  show  the  capabilities  of  our  modified  source  separation  algo¬ 
rithms  via  simulation.  In  our  first  example,  we  employ  the  signal  separation 
method  in  (4)  to  separate  ten  instantaneously-mixed  signals.  In  this  case, 
the  three  signal  sets  {si(k),  S2{k),  ss{k),  S4{k)},  {s^ik),  seik),  S7{k)},  and 
{s8(A;),  S9(A;),  sio{k)]  are  i.i.d.  with  Laplacian,  uniform-[-l,  1],  and  binary- 
lit  1}  distributions,  respectively,  where  the  Laplacian  p.d.f.  is  given  by  ps(s)  = 
0.5e“l^l.  Since  the  first  and  latter  two  distributions  are  super-  and  sub- 
Gaussian,  respectively,  the  algorithms  in  [4,  8]  cannot  linearly  separate  these 
sources  from  an  arbitrary  mixture  of  them.  We  generate  x{k)  as  in  (1),  where 
the  entries  of  H  are  drawn  from  a  uniform- [0, 1]  distribution.  The  values  of 
hij  to  four  decimal  places  are  shown  in  Table  1.  As  is  clear  from  the  table, 
H  exhibits  no  particular  structure,  and  thus  the  extraction  of  the  ten  sources 
from  the  measured  signals  x{k)  is  a  challenging  task. 
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SUL 


Table  1:  The  entries  of  H  for  the  ten  source  separation  example. 

I  9  I  10  I 


L 


L 


J  5 


0.8424 

0.0710 

0.7041 

0.4146 

0.57S1 

0.1854 

0.1480 

0.8016 

0.9349 

0.6022 


0.2641 

0.0570 

0.4257 

0.3068 

0.6326 

0.4599 

0.2081 

0.2801 

0.1214 

0.2266 


0.3297 

0.6562 

0.5772 

0.6455 

0.5300 

0.4425 

0.2611 

0.6381 

0.1404 

0.2975 


0.6648 

0.4651 

0.7443 

0.9982 

0.4357 

0.1115 

0.2721 

0.0364 

0.0398 

0.8565 


0.9273 

0.5970 

0.8824 

0.0223 

0.4652 

0.2274 

0.1238 

0.0919 

0.4975 

0.8663 


0.4364 

0.9322 

0.9441 

0.9772 

0.5468 

0.9701 

0.7741 

0.6716 

0.9554 

0.9502 


0.6019 

0.7337 

0.2775 

0.1033 

0.0358 

0.1367 

0.0425 

0.7553 

0.4872 

0.9896 


0.8918 

0.7483 

0.4733 

0.1339 

0.1748 

0.6777 

0.4375 

0.7595 

0.2505 

0.4781 


0.9564 

0.3582 

0.1009 

0.3141 

0.1137 

0.7476 

0.8457 

0.3628 

0.6368 

0.3873 


0.8823 

0.6291 

0.3935 

0.9203 

0.0413 

0.6656 

0.7489 

0.0460 

0.7157 

0.2050 


We  apply  the  algorithm  in  (4),  where  f{y(k))  —  fk{y{k))  is  adapted  ac¬ 
cording  to  our  method  described  in  (25)-(28)  with  =  0.005,  6  =  0.01, 
W(0)  =  I,  and 

fN(y)  =  |2/|^sgn(y)  and  fp{y)  =  tanh(102/),  (29) 


corresponding  to  nonlinearities  for  separating  sub-  and  super-Gaussian-dis- 
tributed  sources,  respectively.  Within  each  on-line  nonlinearity  selection  pro¬ 
cedure,  we  set  j(k)  to  one  for  all  k.  From  the  outputs  of  the  separation 
system,  we  compute  the  error  vector  e(^)  =  [ei(^)  •  •  •  eio(^)]^  as 

e{k)  =  s(^-D-ip^y(Ar),  (30) 


where  approximate  versions  of  the  permutation  matrix  P  and  scaling  matrix 
D  as  introduced  in  (3)  are  obtained  from  W(^)  and  H  at  iteration  k  =  10000. 
Figure  1  shows  these  ten  error  signals.  Since  each  error  signal  decreases  to  a 
small  value  after  a  sufficient  number  of  iterations,  all  of  the  sources  are  reliably 
extracted  using  our  modified  algorithm.  Figure  2  plots  the  performance  factor 
'ip(k)  defined  as 


||P^W(A:)H||| 
||[pTW(fc)H],|||  ’ 


(31) 


where  ||  •  Us  denotes  the  matrix  Euclidean  norm  and  where  [Q]d  is  a  diagonal 
matrix  whose  (i,  i)th  entry  is  qu.  As  can  be  seen,  the  value  of  V'(^)  decreases 
to  approximately  0.0168  in  steady-state,  indicating  that  the  system  has  ad¬ 
equately  separated  the  ten  sources.  Moreover,  a  careful  examination  of  the 
nonlinearities  chosen  for  each  extracted  output  indicate  that  the  appropri¬ 
ate  stabilizing  nonlinearity  //v(?/)  or  fp{y)  eventually  selected  for  each 
output  signal. 

We  now  combine  the  algorithm  modification  in  (25)-(28)  with  the  blind 
deconvolution  and  source  separation  technique  in  (7)  and  apply  the  resulting 
system  to  a  three-source  separation  problem.  In  this  case,  the  three  sources 
are  chosen  to  be  i.i.d.  Laplacian-,  uniform-,  and  binary-distributed,  respec¬ 
tively,  and  the  convolutive  mixture  model  is  given  by 

x(^)  =  Aix(/:  —  1)  4- A2x(Ar  —  2) -1- Bos(A?) -f  Bis(A:  —  1),  (32) 
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Figure  1:  The  ten  error  signals  for  the  first  source  separation  example, 
where  the  (3  x  3)  matrices  Ai,  A2,  Bq,  and  Bi  are  given  by 


Ai  = 


Bo 


-0.1 

-0.2 

-0.5 

0.02 

0 

0.07 


0.3 

0.2 

-0.2 

0.07 

0.09 

0.04 


-0.1 

-0.3 

0.4 

0.05 

0.08 

0 


A2  = 


and  Bi = 


0.04  0.02  0.03 
0.08  0.04  0 

0.03  0.06  0.06 


0.1 

0.5 

0.7 


0 

0.4 

0.1 


0.4 

0.7 

0.6 


,(33) 


(34) 


Figures  3(a)-(c)  and  (d)-(f)  show  the  vector  sequences  s{k)  and  x{k)  in 
this  case.  The  deconvolution  system  with  time- varying  nonlinearities  was 
applied  to  these  signals,  in  which  T  =  6,  r}{k)  —  0.0005,  ^  =  0.001,  Wp(0)  = 
16 {p  —  3),  and  7(A?)  was  set  to  one  for  all  k  within  the  nonlinearity  selection 
procedures.  Shown  in  3(g)-(i)  are  the  error  signals  ei(k)  =  si{k)  -  y3{k)/d33, 
e2{k)  =  S2{k)-y2{k)/d22,  and  e3(^)  =  S3(k)-yi{k)/dn,  where  dn,  fe,  and 
d33  are  appropriate  scaling  factors.  As  can  be  seen,  the  errors  decrease  to 
small  values  for  each  extracted  output,  and  the  signal- to-noise  ratios  for  the 
three  extracted  outputs  were  empirically  found  to  be  10.0,  7.4,  and  15.2  dB, 
respectively. 

Figure  4  shows  the  actual  value  of  7(^)  —  1  on  a  logarithmic  scale  for  the 
second  source  separation  example.  Starting  from  an  initial  value  of  7(1)  = 
2.94,  the  value  of  j{k)  gradually  approaches  unity  over  time.  These  results, 
combined  with  the  successful  separation  capabilities  of  the  modified  systems, 
indicate  that  setting  7(^)  to  one  within  the  nonlinearity  selection  procedures 
does  not  limit  the  overall  capabilities  of  the  systems. 


443 


Figure  2:  Perfc 


ce  separation  example. 


Figure  3:  The  three  source  signals  ((a)-(c)),  the  three  mixed  signals  ((d)- 
(f)),  and  the  three  error  signals  ((g)-(i))  for  the  convolutive-mixture  source 
separation  example. 


Figure  4;  Evolution  of  j{k)  -  1  for  the  convolutive-mixture  source  separation 
example. 


V.  CONCLUSIONS 


In  conclusion,  we  have  described  techniques  for  selecting  the  nonlinear¬ 
ities  within  blind  source  separation  and  deconvolution  algorithms  to  enable 
the  separation  of  sources  with  arbitrary  distributions.  The  proposed  meth¬ 
ods  can  be  easily  implemented  in  an  on-line  setting.  Simulations  applying  the 
techniques  to  instantaneous  mixture  separation  and  to  multichannel  deconvo¬ 
lution  and  source  separation  indicate  the  ability  of  the  methods  to  accurately 
separate  signal  mixtures  containing  both  sub-  and  super-Gaussian  sources. 
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Abstract-  Recurrent  Canonical  Piecewise  Linear  (RCPL)  network  is  defined 
by  combining  the  canonical  piecewise  linear  function  with  the  autoregres¬ 
sive  moving  average  (ARMA)  model  such  that  an  augmented  inpuL  space  is 
partitioned  into  regions  where  an  ARMA  model  is  used  in  each.  Properties 
of  RCPL  network  are  discussed.  Particularly,  it  is  shown  that  RCPL  func¬ 
tion  is  a  contractive  mapping  and  is  stable  in  the  sense  of  bounded  input 
and  bounded  output  stability.  By  generalizing  Donoho’s  minimum  entropy 
deconvolution  approach  [5]  to  the  nonlinear  case,  it  is  shown  that  RCPL 
network  can  achieve  blind  equalization.  RCPL  network  is  applied  to  both 
supervised  and  blind  equalization  and  results  are  presented  to  show  that  it  is 
computationally  efficient  and  with  a  very  simple  structure,  can  deliver  highly 
satisfactory  performance. 


1.  INTRODUCTION 

Nonlinear  techniques  offer  great  potential  to  improve  on  the  existing  meth¬ 
ods  based  on  the  assumption  of  linearity,  and  in  certain  cases,  to  deal  with 
problems  for  which  linear  techniques  have  proven  ineffective.  Among  the 
approaches  proposed  for  nonlinear  signal  processing,  polynomial  or  Volterra 
filters  [12]  can  yield  a  small  asymptotical  probability  of  error  if  sufficiently 
high  order  polynomials  are  used,  but  they  will  also,  in  general,  converge  very 
slowly.  Neural  network  structures,  such  as  the  multilayer  perceptron  [6]  and 
the  radial  basis  functions  [3]  have  been  proposed  as  an  alternative  to  the 
nonlinear  approximation  problem,  but  in  communications  applications  such 
as  equalization  that  requires  real-time  processing  capability,  their  relatively 
slow  convergence  characteristics  have  been  the  main  consideration.  The  at¬ 
tractiveness  of  piecewise  linear  models  on  the  other  hand,  stems  from  the 
fact  that  while  they  allow  use  of  a  variety  of  analysis  and  development  tools 
that  are  linear,  they  are  also  good  approximators  of  functions  that  are  highly 
nonlinear.  They  have  been  effectively  used  in  control  engineering  [13],  non¬ 
linear  circuit  analysis  [9],  and  in  channel  equalization  [2].  A  special  class  of 
piecewise  linear  structures,  canonical  piecewise  linear  (CPL)  models,  employ 
a  global  linear  model  in  a  partitioned  domain  space  rather  than  using  individ¬ 
ual  linear  models  in  each  partition  as  does  the  piecewise  linear  model.  Hence 
they  greatly  reduce  the  parameter  storage  requirement  of  the  piecewise  lin¬ 
ear  model  and  gain  scheduling  model  [14],  one  of  the  major  problems  in  the 
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implementation  of  piecewise  linear  filters.  We  can  use  standard  linear  adap¬ 
tive  filtering  techniques  to  perform  training  tasks  on  a  piecewise  linear  filter 
and  can  easily  incorporate  known  statistical  information  into  the  network 
structure. 

We  have  introduced  recurrent  CPL  function  (RCPL)  by  combining  the 
CPL  function  with  the  autoregressive  moving  average  (ARMA)  model  [10]. 
Hence,  RCPL  mapping  is  the  process  of  finding  partition  boundaries  of  a  sam¬ 
ple  space  in  an  augmented  domain  space  where  an  ARMA  model  is  used  for 
approximation  in  each  partitioned  region.  The  proposed  RCPL  network  offers 
the  advantages  of  both  the  CPL  and  ARMA  models.  Specifically,  RCPL  net¬ 
work  offers  the  following  benefits;  (1)  it  makes  use  of  standard  linear  adaptive 
filtering  techniques  to  perform  training  tasks  and  allows  for  efficient  selection 
of  the  partition  boundaries;  (2)  it  offers  savings  in  computation  time  and  im¬ 
plementation  cost,  especially  when  modeling  highly  nonlinear  functions;  (3) 
because  of  its  piecewise  linear  nature,  it  is  easy  to  incorporate  known  sta¬ 
tistical  information  into  the  network  structure;  (4)  since  RCPL  network  also 
employs  feedback,  it  has  a  distinct  dynamic  behavior  which  is  much  more 
powerful  than  that  attained  by  the  use  of  finite  duration  impulse  response 
feedforward  structures.  In  this  paper,  we  will  first  review  the  definition  and 
approximation  ability  of  RCPL  network.  Then,  we  discuss  the  properties 
of  RCPL  network.  Finally,  we  consider  application  of  recurrent  canonical 
piecewise  linear  (RCPL)  network  to  both  supervised  and  blind  equalization. 

IL  RECURRENT  PIECEWISE  LINEAR  PIECEWISE  NETWORK 

Before  giving  the  formal  definition  of  a  recurrent  canonical  piecewise  linear 
representation,  first,  we  present  the  definition  of  a  canonical  piecewise  linear 
(CPL)  function.  CPL  network  is  initially  introduced  for  nonlinear  circuit 
analysis  [4].  CPL  structures  provide  a  desirable  compromise  between  the 
approximation  ability  of  nonlinear  models  and  the  efficiency  and  theoretical 
accessibility  of  the  linear  domain,  and  reduce  the  parameter  storage  require¬ 
ment  of  piecewise  linear  models  considerably  by  employing  a  global  linear 
representation. 

The  CPL  function  is  defined  as  [4]: 

Definition  1  {Canonical  Piecewise  Linear  Function):  A  piecewise  linear 
function  /:  D  ^  Q,  with  a  compact  subset  D  C  and  compact  subset 
Q  C  ,  is  called  a  canonical  piecewise  linear  function,  if  it  can  be  expressed 
by  a  global  representation: 

T 

/(x)  =  a  -h  Bx  -F  ^  Ci  |(ai,  x)  -f  A  j  (1) 

i=l 

where  B  6  €  R'^,ai,x€  R^^  and  ft  6  R- 
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On  the  other  hand,  a  very  general  class  of  linear  models  used  for  predic¬ 
tion  of  a  random  process  x{n)  is  the  class  of  ARMA  models  and  a  natural 
generalization  of  the  linear  ARMA  model  is  the  nonlinear  autoregressive  mov¬ 
ing  average  (NARMA)  model 

x{n)  =  h(x{n  -  l),  aj(n  -  2),  •  •  • ,  x{n  -  pi),  e(t  -  1),  e{t  -  2),  •  •  • ,  e(n  -  P2)) 

where  /i  is  a  nonlinear  function  and  e(n)  is  the  input  variable.  Next,  we  in¬ 
troduce  a  RCPL  function  as  an  extension  of  CPL  function  by  incorporating 
the  NARMA  model. 

The  RCPL  function  is  defined  as  [10]: 

Definition  2  (Recurrent  Canonical  Piecewise- Linear  Function):  A  function 
/:  Di  X  D2  X  /  ^  0  with  sample  space  Di  C  ,  D2  C  R"' ,  index  set  I,  and 
compact  subset  Q  C  R^  is  said  to  be  a  RCPL  function  if  it  can  be  expressed 
by  the  global  representation: 

/(x(n),  u(n))  =  a  -h  Box(n)  Bi/(x(n  -  1),  u(n  -  1))  -F  B2u(n)  (2) 
Xk{n)  =  ak+  bf,.x(n  -  1)  +  b|’j/(x(n  -  1),  u(n  -  1)) 

T 

■bb3jfcu(?2)  ^  ^  C]ti  \  {ot-iki  >  x(n  —  1)) 
i—l 

+  (tt2H,/(x(n  -  l),u(n-  1)))  +  (a3H,u(n))  (3) 

where  x,bifc,aui  G  R^ ,  u, b3A:,a3Jbi  €  R^ ,  a, b2A:,  Bo  G  R^ ,  Bi  G 
jlMxN^  B2  G  R^^\  ak,Cki,l3ki  e  R,  k  =  l,2r--,N  ^nd  Xk  is  the  A:th  el¬ 
ement  of  vector  x.  We  refer  to  the  structure  defined  by  (2)  and  (3)  as  the 
recurrent  canonical  piecewise  linear  network. 

From  the  definition,  we  see  that  the  domain  of  RCPL  function  is  par¬ 
titioned  into  polyhedral  regions  where  the  function  defines  a  linear  ARMA 
model  in  each. 

III.  PROPERTIES  OF  RECURRENT  PIECEWISE  LINEAR  NET¬ 
WORK 

Based  on  Definition  1,  we  show  the  approximation  ability  of  CPL  network  by 
the  following  theorem: 

Theorem  1:  Let  domain  D  be  a  compact  space  of  dimension  N  and  be 
a  set  of  canonical  piecewise  linear  functions  on  D.  Then,  for  any  continuous 
function  f  on  D,  there  exists  a  function  /  G  ^  such  that  |/(x)  —  /(x)[  <  e 
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for  all  X  E  D. 


The  proof  of  the  theorem  is  given  in  [1].  By  comparing  definitions  1  and 
2,  we  can  see  that  RCPL  filter  is  actually  a  special  case  of  the  CPL  filter. 
The  RCPL  filter  partitions  the  input  signal  space  into  finite  disjoint  regions 
and  in  each  region,  it  can  be  represented  by  a  FIR  filter  with  infinite  length. 
Therefore,  the  result  presented  in  Theorem  1  also  holds  for  the  RCPL  filter. 

We  can  rewrite  the  RCPL  network  described  by  (2)  and  (3)  as  follows: 

x(n)  =  a  + Bix(n  -  1)  +  B2u(n)  (4) 

r 

+  |(dH,x(n  -  1))  +  (a2i,u(n)>  + 

Z  =  1 

where  x(rz)  =  (x(n)  /(x(n))^.  Using  this  representation  for  RCPL,  we 

prove  the  following: 

Theorem  2:  For  the  RCPL  network  defined  by  (2)  and  (3),  assume  that 
the  input  vector  u(n)  is  bounded  and  the  parameters  satisfy  the  following 
condition:  If  there  exists  an  Co  E  (0, 1)  such  that 

r 

||Bi||  +  ^  ^  !ki||  li^lill  ^  1  “  ^0  (h) 

i=l 

then,  there  is  a  real  number  d,  such  that  for  all  K  >  d,  the  ball  D{K)  = 
{x  :  ||x||  <  K}  is  invariant  under  (2)  and  (3). 


Proof:  Define  F(x{n  —  1))  =  x(n).  From  (4),  we  have 

r 

F(x{n-l))  -  a4-Bix(n-l)+B2u(n)+^Ci  |(dii,x(n  -  1)>  +  {dL2i,u{n))  Pi 

i=l 

Since  u(n)  is  bounded,  there  exists  a  real  number  ro  such  that  for  |  |x(n— 1)|  |  > 
ro,  we  have 

||a||  +  ||B2||||u(n)||  +  ELilkdl(ll«2.ll  ||u(n)||+|A|) 

x(n-  1)11 

therefore 

r 

||a  +  Bix(n  -  1)  +  B2u(n)  +  |(d:ii,x(n  -  1))  +  {dL2i,u{n))  +  pi 

i=l 

<  Hall +  (||Bi||  +  ^  INI  ||di..||)I|x(n- 1)11 

i=l 

+  ||B2ll||u(n)||  +  ^||c.-||(l|d2.||  IMa)ll+lftl) 

i=l 


<  ^0  (6) 
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Using  equations  (5)  and  (6),  we  have 

||F(x(n  -  1))||  <  (1  -  6:o)||x(n  -  1)11  +  £:o||x(n  -  1)11 

<  ||x(n  — 1)11  for  ||x(n  —  1)11  >  ro  (7) 

On  the  other  hand,  because  F(x{n-1))  is  continuous,  there  exists  a  constant 
Kq  such  that 

||F(x(n-l))||</i:o  for  l|x(n  -  1)||  <  ro 
Let  d  =  max{ro,  /<o}-  Then,  for  every  real  number  K  >  d,  we  have 
||F(x(n  -  1))||  <  Ko  <  K  for  every  ||x(n  -  1)||  <  ro 
and  by  using  (7) 

|li^(x(n  -  1))||  <  ||x(n  -  1)11  <  K  for  every  ro  <  ||x(n  -  1)||  <  K 
Eence,  D{K)  is  invariant. 

The  above  theorem  states  that  the  output  of  RCPL  network  described  by 
bounded  input  if  condition  (5)  is  satisfied. 

Theorem  3:  The  map  which  defines  the  RCPL  network  (2)  and  (3)  is  a 
contractive  mapping  if  the  condition  given  in  (5)  is  satisfied. 

Proof.  Let 

T 

k(x)  =  a  +  Bix  +  B2u(n)  +  |(aii,x)  -}-  (a2i,u(n))  +  A| 

i  =  l 

then, 

k{xi)  —  k{x2)  =  Bi(xi  -  X2) 

T 

+  |{aii,xi)  +  (a2i,u(n))  +  A|  -  |(aii,X2)  +  (a2t,u(n))  +  A| ) 

i=l 

and 

T 

||/fc(xi)  -  fc(x2)||  <  ( IIBill  +  ||c.-||  llaiill )  ||x,  -  X2II  <  (1  -  So)l|xi  -  xjH 

i-l 

which  shows  that  k{’)  is  a  contractive  mapping  whenever  (5)  is  satisfied. 
Thus,  after  receiving  the  input  vector  u(n),  the  network  will  always  reach  a 
unique  equilibrium  regardless  of  its  initial  state  xo . 

Theorems  1-3  state  that  under  certain  regularity  conditions,  the  RCPL 
network  is  always  stable  in  the  sense  of  bounded  input  and  bounded  output 
stability,  and  can  approximate  any  nonlinear  function  with  arbitrary  accuracy. 
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IV.  BLIND  EQUALIZATION  BY  RCPL  NETWORK 

Blind  equalization  refers  to  the  problem  of  determining  a  transmitted  symbol 
sequence  in  the  presence  of  intersymbol  interference  (ISI)  and  noise  with¬ 
out  using  a  training  sequence.  Most  of  existing  blind  techniques  such  as  the 
Bussgang  algorithm,  cyclic  spectrum  approach,  poiyspectra  approach,  etc., 
are  based  on  the  linear  channel  assumption.  There  are  many  cases,  how¬ 
ever,  where  this  assumption  is  not  true,  as  nonlinear  devices  significantly 
contribute  to  system  degradation.  One  example  is  the  digital  satellite  link, 
in  which  both  the  earth  station  and  the  satellite  are  equipped  with  amplifiers 
operated  in  a  nonlinear  region  of  the  input-output  characteristics  for  better 
exploitation  of  the  power  of  the  device.  The  use  of  the  above  blind  techniques 
will  suffer  from  a  severe  performance  degradation  for  unknown  nonlinear  com¬ 
munication  channels  and  hence  it  is  very  important  to  develop  nonlinear  blind 
equalization  techniques.  We  show  that  nonlinear  blind  equalization  can  be 
achieved  by  matching  the  distribution  of  the  channel  input  with  that  of  the 
RCPL  equalizer  output. 

To  show  the  ability  of  RCPL  network  to  achieve  blind  equalization,  we 
first  discuss  some  results  based  on  the  CPL  network.  We  introduce  the  fol¬ 
lowing:  The  nonlinear  channel  h(-)  maps  the  input  sequence  a:(n)  E  to 
y{n)  -  h{x{n),  x(n  -  1),  •  •  • ,  x(n  -  I  -  p))  and  the  CPL  equalizer  aims  to 
recover  the  input  sequence  by  constructing  a  mapping  h^q  D  Q,  where 
D  C  and  Q  C  R-  Assume  that  the  global  system,  cascade  of  the  non¬ 
linear  channel  h{-)  and  the  CPL  equalizer  heg(-)  is  denoted  by  T,  and  is 
modeled  by  a  CPL  network  which  divides  the  input  space  into  m  disjoint 
regions,  Ri,R2r  '  ’  ,Rm,  and  in  each  region  Ri,  the  CPL  function  given  in  (1) 
is  equivalent  to  the  following  linear  model: 

* 

Mil  x{n)  =i^WijXj(n)  (8) 

i=i 

where  Xj{n)  =  x{n  ~  j  1)  and  x(n)  is  the  output  of  the  equalizer. 

We  then  make  the  following  assumptions: 

(i)  Input  sequence  {a:(n)}  is  an  i.i.d.  random  process. 

(ii)  The  distribution  of  x(n)  is  symmetric  about  zero  with  finite  variance. 

(hi)  The  mapping  Mi,  i  =  1,2,  •• -,771  is  a  one  to  one  mapping,  and 

Ii  n  Ij  =  0. 

We  then  prove  the  following: 

Theorem  4:  Consider  the  global  system  T  defined  by  (8)  and  that  the  as¬ 
sumptions  (i)-(iii)  are  satisfied.  If  the  distribution  of  {^(n)}  is  the  same  as 
that  of  x{n),  then,  the  global  system  T  is  identity  except  for  a  possible  delay 
and  a  sign  factor. 
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The  proof  of  Theorem  4  is  given  in  [1].  As  we  explained  in  section  III,  the 
RCPL  filter  is  a  special  case  of  the  CPL  filter.  Therefore,  the  result  presented 
in  Theorem  4  is  also  true  for  the  RCPL  filter.  Hence,  any  nonlinear  channel 
can  be  represented  as  a  RCPL  function,  and  furthermore,  if  we  use  a  RCPL 
network  as  an  equalizer,  then,  the  global  system  which  consists  of  cascade  of 
the  the  channel  and  the  equalizer  is  still  a  RCPL  function.  Thus,  for  blind 
equalization,  we  can  update  the  weights  of  RCPL  equalizer  such  that  the 
instantaneous  distribution  of  the  output  x{n)  of  the  equalizer  converges  to 
the  input  distribution  u. 

V.  EXAMPLES 

1.  Blind  Equalization 

Several  cost  functions  such  as  moment  error  cost  function  [8],  Godard/Sato 
cost  function  [7],  Vembu-Verdu’s  convex  cost  function  [15]  and  partial  likeli¬ 
hood  cost  function  [1]  can  be  used  for  distribution  matching.  Godard/Sato 
cost  function  is  not  a  convex  cost  function,  the  derived  blind  algorithm  may 
only  find  the  local  minimum.  But,  if  proper  initial  weights  are  chosen,  the 
algorithm  can  find  the  global  minimum  and  the  equalizer  prediction  error 
tends  to  zero.  Thus,  the  resulting  blind  equalizer  has  larger  stable  margin. 
Vembu-Verdu’s  convex  cost  function  can  help  the  algorithm  to  find  the  global 
minimum  for  linear  channel  and  the  equalizer  prediction  error  can  quickly 
enter  the  decision  boundary,  however,  it  results  in  larger  residual  prediction 
error  after  convergence.  In  [11],  a  new  equalizer  structure  is  proposed  by  in¬ 
corporating  RCPL  network  with  decision  feedback,  and  a  blind  algorithm  is 
presented  based  on  both  the  Godard  [7]  and  Vembu-Verdu’s  cost  function  [15] 
for  the  RCPL  network  based  equalizer.  We  first  use  Vembu-Verdu’s  cost  func¬ 
tion  in  the  learning  process  and  then  switch  to  the  Godard  cost  function  when 
the  absolute  gradient  of  Godard  error  changes  are  very  slow.  As  an  example,  a 
nonlinear  communication  channel  g{n)  =  +  where  the  nonmin¬ 

imum  phase  multipath  component  is  given  by  gi(n)  =  0.9^(n)-l-^(n  — 1)  is  con¬ 
sidered  in  [11].  The  input  a:(n)  takes  values  form  the  binary  set  S  =  {—1,1} 
and  has  a  symmetric  distribution.  The  simulation  results  given  in  [11]  indi¬ 
cate  that  the  blind  algorithm  which  is  derived  based  on  combined  Godard  cost 
function  and  Vembu-Verdu’s  cost  function  exhibits  good  tradeoff  in  terms  of 
robustness  and  achieving  low  equalizer  prediction  error.  The  developed  blind 
algorithm  is  much  faster  than  that  derived  based  on  the  Godard  cost  function 
or  the  Vembu’s  cost  function  alone.  Here,  we  only  show  the  bit  error  curves 
of  three  cost  functions  in  Figure  1.  The  RCPL  network  based  decision  feed¬ 
back  equalizer  outperforms  the  linear  decision  feedback  equalizer  and  CMA 
equalizer  when  equalizing  a  nonlinear  channel. 

2.  Adaptive  Equalization 

A  simple  RCPL  network  and  corresponding  learning  algorithm  is  given  in  [10]. 
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Assume  that  y{n)  is  the  channel  output  corresponding  to  the  transmitted 
signal  a:(n)  at  time  instant  n.  Let  y(n)  =  [y(n),  •  •  • ,  y{n  —  Mi  +  1)]^  be  the 
input  vector  of  RCPL  network  and  x{n)  be  the  output  of  network  trained  to 
approximate  x[n).  We  choose  the  nonlinear  function  /fc(*)  of  the  RCPL  filter 
shown  in  Figure  2  as 

fki^kin))  =  \zk{n)  +  l\  -  \zk{n)  -  ll  Ar  =  1,  2,  •  •  •,  Mi 

to  equalize  the  following  channel: 

y(n)  =  yi(n)~\-0.Q2yi{n)'^ -\-7j{n)  (9) 

where  the  multipath  component  is  given  by  yi(n)  =  x{n)  +  0.5a7(u  -  1)  + 
^Ax{n  -  2),  x{n)  is  the  input  signal,  y(n)  is  the  channel  output,  and  77(72) 
denotes  the  zero  mean,  white  noise  component. 

Figures  3-5  compare  the  performance  of  the  RCPL  equalizer  with  that  of 
the  multilayer  perceptron  equalizer  and  the  recurrent  neural  network  (RNN) 
equalizer  for  16-PAM,  8-PAM  and  2-PAM  signal  transmission  over  the  mul¬ 
tipath  channel  of  (9).  Here,  RCPL  equalizer  has  only  3  nodes  (M  =  3)  and 
the  MLP  equalizer  has  2  hidden  layers  with  11  nodes  in  each  layer.  The  RNN 
equalizer  has  the  the  same  number  of  nodes  as  the  RCPL  filter.  Its  activation 
function  is  a  hyperbolic  tangent  function.  The  RNN  equalizer  is  modified 
such  that  it  has  the  same  structure  as  the  RCPL  filter,  that  is,  no  activation 
function  is  used  in  its  output  layer  and  no  delay  is  employed  between  the 
most  recent  output  of  channel  and  the  output  of  the  equalizer.  This  modified 
RNN  structure  gives  much  better  results  than  standard  RNN  structure  given 
in  [8].  The  variance  of  noise  is  0.01  in  all  the  simulations.  As  seen  in  this  ex¬ 
ample,  performance  of  RCPL  equalizer  which  has  piecewise  linear  activation 
function  is  much  more  superior  to  that  of  the  MLP  equalizer  and  it  exhibits 
comparable  performance  with  that  of  the  modified  RNN  equalizer  which  uses 
hyperbolic  tangent  activation  function.  We  also  compare  the  performance  of 
RCPL  equalizer  by  using  different  learning  rates  in  Figure  6.  Th  simulation 
results  shows  that  the  learning  rate  ai  which  corresponds  to  the  linear  part 
of  RCPL  equalizer  (i.e,  the  filter  with  index  0  in  Figure  1)  plays  an  important 
role  in  the  learning  processing  in  that  it  controls  the  rate  of  convergence.  The 
choice  of  learning  overall  rate  02  which  corresponds  to  the  nonlinear  part  (i.e, 
the  filters  with  indices  1,  •  •  ■ ,  M  in  Figure  1)  is  more  flexible  than  ai  since  the 
choice  of  order  M  is  more  important  than  that  of  the  learning  rate  for  this 
part. 
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Figure  1:  BER  comparison  Figure  2:  A  simple  RCPL  network 


Figure  3:  16- PAM  equalization 


Figure  4:  8- PAM  equalization 


Figures’:  2- PAM  equalization 


Figure  6:  Learning  rate  comparison 
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Abstract 

We  derive  new  unsupervised  learning  rules  for  blind  sepa¬ 
ration  of  mixed  and  convolved  sources.  These  rules  are  non¬ 
linear  in  the  signals  and  thus  exploit  high-order  spatiotempo- 
ral  statistics  to  achieve  separation.  The  derivation  is  based  on 
a  global  optimization  formulation  of  the  separation  problem, 
yielding  a  stable  algorithm.  Different  rules  are  obtained  from 
frequency-  and  time-domain  optimization.  We  illustrate  the 
performance  of  this  method  by  successfully  separating  convo- 
lutive  mixtures  of  speech  signals. 


1  INTRODUCTION 

In  the  problem  of  linear  square  blind  separation  [1],  one  considers  L  indepen¬ 
dent  signal  sources  Xi{t)  (e.g.,  different  speakers  in  a  room)  and  L  sensors 
yi(t)  (e.g.,  microphones  at  several  locations).  Each  sensor  receives  a  mixture 
of  the  source  signals.  The  task  is  to  recover  the  original  sources  from  the 
observed  sensor  signals.  The  separation  is  termed  blind  because  it  must  be 
achieved  without  any  information  about  the  sources,  apart  from  their  statis¬ 
tical  independence. 
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Blind  separation  algorithms  can  have  many  applications  in  areas  invloving 
processing  of  multi-sensor  signals,  such  as  speech  enhancement  (the  ‘cocktail 
party’  problem)  and  the  analysis  and  interpretation  of  biomedical  signals 
(e.g.,  EKG,  EEG  [8]).  Most  of  the  separation  methods  that  have  been  pro¬ 
posed  aim  at  a  simplified  version  of  the  problem  where  the  mixing  process 
is  linear  and  instantaneous  (memoryless).  In  that  case  we  seek  a  separating 
transformation  gij  that,  when  applied  to  the  sensor  signals  yi(t),  will  recover 
the  sources,  possibly  scaled  and  permuted:  Xi{t)  =  particular, 

independent  component  analysis  (ICA)  algorithms  [2-7]  can  identify  gij  fast 
and  efficiently  in  many  cases. 

However,  the  mixing  in  realistic  situations  is  not  memoryless,  due  to  multi- 
path  propagation  and  the  impulse  response  of  the  medium  and  of  the  sensors. 
The  resulting  ‘convolutive’  mixtures  cannot  be  sep2Lrated  by  ICA  methods. 
In  this  paper  we  present  a  novel  unsupervised  learning  algorithm  for  blind 
separation  of  linear,  time-invariant  mixtures  with  memory,  termed  dynamic 
component  analysis  (DC A).  The  separation  in  this  case  requires  a  transfor¬ 
mation  with  a  dynamic  impulse  response  gij{t)  (a  matrix  of  filters), 

L  °° 

*<(*)  =  1C  /  -  *') .  (1) 

•>■=*0 

where  Xi{t)  are  the  recovered  source  signals.  More  generally,  the  signals  yi{t) 
may  be  taken  from  any  temporal  multi-sensor  data  set;  the  new  signals  Xi{t) 
are  termed  the  dynamic  components  (DC)  of  those  data. 

Like  the  original  sources,  the  DC’s  are  characterized  by  their  statisti¬ 
cal  independence,  and  consequently  by  the  property  that  their  joint  mo¬ 
ments  factorize.  In  the  time  domain,  this  implies  {xi{t)'^Xj{t  -{-  r)’^)  = 
{xi{t)'^){xj{t  t)^),  for  i  ^  j  and  all  orders  m,n  at  any  time  lag  r;  the 

average  is  taken  over  time  t.  Note  that  in  contrast,  the  independent  compo¬ 
nents  found  by  ICA  algorithms  satisfy  this  property  only  for  r  =  0. 

In  order  to  find  the  separating  transformation  gij{t),  one  could  impose 
the  joint  moment  factorization  as  a  condition  on  the  resulting  signals  Xi{t). 
Rather  than  imposing  it  explicitly,  which  can  practically  be  done  only  for 
low-order  moments  [14],  an  effective  way  to  impose  this  condition  implicitly 
and  to  all  orders  is  to  formulate  the  separation  task  as  an  optimization  prob¬ 
lem  via  the  use  of  a  latent- variable  model  [10].  Specifically,  we  construct 
a  model  for  the  joint  distribution  of  the  sensor  signals  over  A’-point  time 
blocks,  pj,[y(to),  •••5y(tjv-i)]j  parametrized  by  the  separating  filter  matrix 
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g{to),  Next,  we  define  the  ‘distance’  between  our  model  sensor 

distribution  and  the  observed  distribution  using  the  Kullback-Leibler  (KL) 
distance  [11],  an  information  theory-based  measure  for  the  distance  between 
two  distributions.  The  model  parameters  are  then  optimized  to  minimize 
this  distance  by  the  stochastic  gradient  descent  method,  yielding  the  DCA 
learning  rules  for  gij{t). 

This  global  optimization  formulation  of  the  problem  can  be  given  in  either 
the  frequency  domain  or  the  time  domain.  Section  2  presents  the  frequency- 
domain  formulation  and  the  associated  learning  rules,  whereas  the  time- 
domain  version  is  given  in  Section  3.  The  performance  of  DCA  is  illustrated 
in  Section  4  by  successfully  separating  convolutive  mixtures  of  speech  signals. 

Notation:  we  work  in  discrete  time  tn>  Lower-case  symbols  are  used  for 
time-domain  quantities  and  upper-case  symbols  for  their  frequency-domain 
counterparts.  We  use  subscripts  to  refer  to  discrete  times  and  frequencies, 
e.g.,  Xn  =  x{tn)  and  Xk  =  X{(jJk)’  Vectors  and  matrices  are  boldfaced. 

2  FREQUENCY-DOMAIN  OPTIMIZATION 

Let  x„  be  the  L-dimensional  model  source  vector,  whose  elements  Xi^n  = 
Xi(tn)  are  the  source  activities  at  time  tn',  these  are  the  latent  variables.  Let 
Yn  be  the  L-dimensional  model  sensor  vector.  We  work  with  N-point  time 
blocks  =  0,  ...,N  —  1.  The  two  are  related  by 

M-l 

Xn  =  ^  ^  Smyn— m  j  Xfc  =  Cr/;Y^  ,  (2) 

m=0 

where  the  separating  transformation  is  a  matrix  of  filters  of  length  M 
N,  and  G*  =  G(u;a;)  is  its  N-point  DFT.  We  focus  first  of  the  frequency- 
domain  formulation  (r.h.s.  of  (2))  where  the  separation  problem  factorizes. 

To  construct  a  model  sensor  distribution  py({YA;})  we  must  start  with 
a  model  source  distribution  px({Xfc}).  We  use  a  factorial  frequency-domain 
model, 

L  N/2-l 

PAr({X*})  =  n  n  PiAXi,k),  (3) 

i=l  *=1 

where  Pi^k  is  the  joint  distribution  of  ReXj^fe ,  ImXj^fe .  From  (2)  we  obtain 
Py  =  det(GjfcGj)px,  which  depends  on  the  separating  parameters  gm 
and  the  parameters  used  to  describe  Pi^k  (see  below). 
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Denoting  the  observed  sensor  signals  by  Y*,  we  now  define  a  distance 
measure  D  between  their  joint  distribution  py  and  our  model  distribution 
Py-  For  this  purpose  we  adopt  the  KL  distance  function  [11],  which  can  be 
shown  to  satisfy  D{py,py)  =  —Hy  —  (logpy)y ;  the  second  term  on  the  r.h.s. 
is  evaluated  by  averaging  logpy  (Y)  using  the  observed  distribution  py.  Since 
Hy,  the  entropy  of  the  observed  signals,  is  independent  of  the  mixing  model 
parameters,  minimizing  D  is  equivalent  to  maximizing  the  log-likelihood  of 
the  data,  (logpy)y,  with  respect  to  gm-  It  follows  that 

+ 

after  dropping  the  average  sign  and  terms  independent  of  gm- 

Before  deriving  the  learning  rules  we  make  a  few  simplifications  in  the 
model  (3)  by  omitting  the  frequency  dependence  of  Pi^k  and  using  the  same 
parametrized  functional  form  for  all  sources.  In  addition,  we  restrict  Pi^kiXi^k) 
to  depend  only  on  the  squared  amplitude  |  Xi^k  P-  These  simplifications  are 
made  for  convenience,  but  a  more  complicated  parametrization  can  be  used  in 
situations  where  the  actual  source  distribution  depends  non-trivially  on  the 
frequency  or  phase.  Note  that  our  model  sources  are  white,  in  anticipation 
of  the  whitening  effect  discussed  below.  Hence  Pi^k(Xi^k)  =  P{\  Xi^k  P;6)j 
where  is  a  vector  of  parameters  for  source  i.  For  instance,  P  may  be  a 
mixture  of  Gaussian  distributions  whose  means,  variances  and  weights  are 
contained  in 

The  frequency-domain  DCA  learning  rules  for  the  separating  filters  gm 
and  the  source  distribution  parameters  are  now  obtained  using  a  stochastic 
gradient  descent  minimization  of  the  KL  distance  (4): 

SGk  =  e[l-$(X*)Xt]G4, 

1  Ft 

!';«») .  (5) 

where  6gm  are  obtained  from  6Gk  by  inverse  DFT  for  0  <  m  <  M  —  1  and 
are  set  to  zero  for  m  >  M.  The  vector  $(Xfc)  above  is  related  to  the  model 
source  distribution  by 

=  -Xi,k^logP{a  =1  Xi,,  .  (6) 

The  learning  rate  is  set  by  e. 


1  / 

D{Py,Py)  =  -jz  Y  (logdetGfcG^ 

Jfe=l  \ 
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We  point  out  that  to  derive  the  rule  for  gm  we  used  6gm  =  ~edD / dgm, 
but  since  the  resulting  JG*  required  matrix  inversion  for  each  ujk  at  each 
iteration,  we  multiplied  it  by  the  positive-definite  matrix  Gj^G;^  to  get  the 
less  expensive  rule  (5).  It  can  be  shown  [9]  that  this  rule  indeed  decreases  D  at 
each  iteration  in  the  small-e  limit.  Furthermore,  it  can  be  shown  to  satisfy  the 
property  of  equivariance  (see  [6,7]  for  equivariant  algorithms  for  instantaneous 
mixing),  which  guarantees  uniform  performance  across  all  invertible  mixing 
processes. 

We  emphasize  the  importance  of  using  a  time  block  that  is  sufficiently 
longer  than  our  model  filters.  As  is  evident  in  the  frequency-domain  for¬ 
mulation  (r.h.s.  of  (2)  and  ^G*  rule  of  (5)),  we  are  effectively  solving  N 
individual  mixing  problems,  one  at  each  w*,  and  risk  recovering  the  sources 
with  different  ordering  permutation  at  different  frequencies,  possibly  reducing 
the  separation  quality.  The  key  point  here  is  that  these  N  problems  (or  N 
increments  (JG*)  are  not  independent,  since  the  minimization  of  the  distance 
function  with  respect  to  the  M  time-domain  coefficients  gm  couples  them  and 
solves  them  simultaneously.  Consequently,  to  minimize  the  freedom  of  arbi¬ 
trary  permutations  by  exploiting  this  coupling  we  must  choose  M  N.  Note 
that  this  difficulty  does  not  reflect  a  limitation  of  any  particular  algorithm; 
rather,  it  is  inherent  to  the  convolutive  mixing  problem. 

a  TIME-DOMAIN  OPTIMIZATION 

We  now  derive  the  learning  rules  for  the  separating  filter  matrix  starting 
from  the  time-domain  description  (2).  For  the  model  source  distribution  we 
use  the  factorial  form 

L  N-l 

Pxjil^m})  =  n  n  Pi^rn{Xi,m)  •  (7) 

i=l  m=0 

Using  (7)  together  with  (2),  it  is  straightforward  to  derive  the  time-domain 
model  sensor  distribution  Py  and  its  KL  distance  D  to  the  observed  distribu¬ 
tion  Pyl 

^  N~1  L 

DiPy,Py)  =  -logdetgo  -  —  EE  log  Pi, m  .  (8) 

m=0  i=l 

As  in  the  frequency-domain  case,  we  simplify  the  model  (7)  by  omitting 
the  t^-dependence  (assuming  stationary  sources)  and  using  the  same  func¬ 
tional  form  for  all  sources,  parametrized  by  the  vector  ^i.  Hence  pi  ,m(^i,m)  — 
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Stochastic  gradient  descent  minimization  of  the  KL  distance  (8) 
yields  the  time-domain  DC  A  learning  rules: 

SG,  =  e(go’’)-^-el$t(x)Yt  . 

1  8 

^z)  5  (9) 


where  (5g^  is  obtained  from  by  inverse  DFT.  The  vector  ^jfc(x)  above 
is  the  DFT  of  ^fc(x)  =  The  latter  is  related  to  the 

m 

model  source  distribution  by  the  logarithmic  derivative 


d 

'4’{Xi^7n‘,^i)  =  --^logp{a  =  Xi^m;Ci)  ■ 


(10) 


The  JGfc  rule  (9)  is  related  to  the  rule  derived  in  [12]  (see  also  [13])  using  in¬ 
formation  maximization  considerations.  It  is  not  equivariant  and  is  therefore 
not  as  efficient  as  (5). 


4  SEPARATION  OF  SPEECH  SIGNALS 

We  illustrate  the  performance  of  DCA  by  applying  it  to  a  convolutive  mix¬ 
ture  of  speech  signals.  We  mixed  two  lOsec-long  signals,  obtained  from  a 
commercial  CD  at  the  original  sampling  rate  of  44,lKHz  and  down-sampled 
to  fs  =  4.41KHz,  by  filters  hij^n^  whose  impulse  response  is  displayed  in  Fig¬ 
ure  1.  We  then  used  the  learning  rules  (5)  and  (9)  to  find  the  separating 
filters  5ij,n.  The  signals  were  processed  in  512-point  non-overlapping  blocks, 
incrementing  the  separating  filters  after  each  block  with  e  =  0.01. 

We  used  an  exponential  form  for  the  model  source  distribution,  p  oc 
which  approximates  the  distribution  of  the  speech  signals,  as  well  as  a  large 
class  of  natural  signals  [15].  We  also  experimented  with  other  distributions, 
such  as  the  sigmoid-derivative  form  p  oc  e“^®/(l  -I-  used  in  [3].  Note 

that  for  the  frequency-domain  rule  it  is  necessary  to  scale  the  variance  by  N 
or,  alternatively,  to  modify  the  DFT  definition  by  y/N. 

To  demonstrate  that  separation  has  actually  been  accomplished,  we  present 
the  convolution  {g  ★  h)ij^n  of  the  separating  with  the  mixing  filters  in  Figure 
2.  In  the  case  of  time-domain  separation  (solid  line)  the  non-diagonal  filters 
(^★/i)i2,n  and  {g'kh)2i,n  are  strongly  attenuated  compared  to  the  diagonal 
ones,  whereas  in  the  case  of  frequency-domain  separation  (dashed  line)  the 
opposite  is  true.  Thus  separation  has  been  achieved  by  both  learning  rules, 
followed  by  an  order  permutation  for  the  latter. 
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Figure  2:  Convolution  of  the  separating  filters  gij,ni  obtained  from  time- 
domain  (solid  line)  and  frequency-domain  (dashed  line)  optimization,  with 
the  mixing  filters  hij^n- 


Note  that  the  separation  has  modified  the  source  power  spectra,  since 
for  the  diagonal  solid-line  filters  {g  ★  #  0  forn  >  0,  and  similarly  for 

the  non-diagonal  dashed-line  filters.  This  whitening  effect  results  from  the 
fact  that  a  general  convolutive  mixing  situation  is  defined  only  to  within  an 
arbitrary  permutation  and  filtering  of  the  sources,  just  as  an  instantaneous 
mixing  situation  is  defined  to  within  source  permutation  and  scaling.  This 
ambiguity  is  evident  in  the  frequency-domain  form  of  (2):  indeed,  the  dis¬ 
tinction  between  the  power  spectrum  of  the  sensor  signals  (|  Yj^k  P)  and  the 
separating  transformation  Gij,k  cannot  be  made  without  prior  information 
about  the  sources  or  the  mixing  situation. 


5  CONCLUSION 

In  this  paper  we  presented  an  optimization  formulation  of  the  problem  of 
convolutive  mixing,  which  allowed  us  to  derive  the  DCA  rules  for  learning 
the  separating  filter  matrix  from  the  observed  mixtures  in  an  unsupervised 
fashion.  This  formulation  is  advantageous  since  it  results  in  a  stable  algorithm 
and  facilitates  a  systematic  derivation  of  time-  and  frequency-domain  learning 
rules. 

The  DCA  learning  rules  are  non-linear  in  the  signals  and  require  process¬ 
ing  in  N-point  time  blocks.  Consequently,  this  algorithm  exploits  high-order 
temporal  and  inter-sensor  statistics  to  achieve  separation.  The  rules  pre¬ 
sented  here  whiten  the  recovered  sources;  in  [9]  we  show  a  way  to  avoid  this 
by  learning  the  source  spectra  as  additional  model  parameters.  Consequently, 
we  must  impose  appropriate  constraints  on  the  mixing  model  to  avoid  the  am¬ 
biguity  mentioned  above.  To  do  that  we  derive  learning  rules  for  the  mixing 
transformation.  Those  rules,  however,  do  not  satisfy  the  equivariant  property, 
in  contrast  with  the  frequency- domain  rules  for  the  separating  transforma¬ 
tion  derived  in  the  present  paper;  nevertheless,  in  addition  to  facilitating  the 
imposition  of  constraints,  learning  the  mixing  filters  has  the  advantage  of  re¬ 
ducing  model  complexity  since  they  are  usually  shorter  than  the  separating 
filters. 

Algorithms  that  solve  the  problem  of  blind  source  separation  address,  in 
fact,  the  more  general  need  for  an  efficient  tool  for  statistical  analysis  of  tem¬ 
poral  multi- variable  data  sets.  We  are  currently  using  DCA  to  perform  source 
analysis  of  auditory  evoked  potentials  in  magnetoencephalogram  (MEG)  mul¬ 
tichannel  recordings  [16],  where  this  technique  is  capable  of  separating  the 
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contributions  to  the  sensor  signals  from  different  sources  of  neural  activity 
that  respond  simultaneously  to  the  stimulus. 
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Abstract 

The  removal  of  noise  from  speech  signals  has  applications  ranging 
from  speech  enhancement  for  cellular  communications,  to  front  ends 
for  speech  recognition  systems.  A  nonlinear  time-domain  method 
called  dual  extended  Kalman  filtering  (DEKF)  is  presented  for  re¬ 
moving  nonstationary  and  colored  noise  from  speech.  We  further 
generalize  the  algorithm  to  perform  the  blind  separation  of  two  speech 
signals  from  a  single  recording. 

INTRODUCTION 

Traditional  approaches  to  noise  removal  in  speech  involve  spectral  techniques,  which 
frequently  result  in  audible  distortion  of  the  signal.  Recent  time-domain  nonlinear 
filtering  methods  utilize  data  sets  where  the  clean  speech  is  available  as  a  target  sig¬ 
nal  to  train  a  neural  network.  Such  methods  are  often  effective  within  the  training 
set,  but  tend  to  generalize  poorly  for  actual  sources  with  varying  signal  and  noise 
levels.  Furthermore,  the  network  models  in  these  methods  do  not  fully  take  into  ac¬ 
count  the  nonstationary  nature  of  speech.  In  the  approach  presented  here,  we  assume 
the  availability  of  only  the  noisy  signal.  Effectively,  a  sequence  of  neural  networks 
is  trained  on  the  specific  noisy  speech  signal  of  interest,  resulting  in  a  nonstationary 
model  which  can  be  used  to  remove  noise  from  the  given  signal. 

A  noisy  speech  signal  y{k)  can  be  accurately  modeled  as  a  nonlinear  autoregression 
with  both  process  and  additive  observation  noise: 

x(k)  =  f{x{k-l),.,.x(k-M),yf)^v{k)  (1) 

y{k)  =  (2) 

where  x{k)  corresponds  to  the  true  underlying  speech  signal  driven  by  process  noise 
v(k),  and  /(•)  is  a  nonlinear  function  of  past  values  of  x(k)  parameterized  by  w. 
The  speech  is  only  assumed  to  be  stationary  over  short  segments,  with  each  segment 
having  a  different  model.  The  available  observation  is  y(k),  which  contains  additive 
noisen(fc).  The  optimal  e5riwwr(9rgiven  the  noisy  observations  y  (A:)  =  {y(k),y{k- 
l)i  ‘  *  2/(0)}  is  E'[a;(/:)|y(/b)].  The  most  direct  way  to  estimate  this  would  be  to  train 
on  a  set  of  clean  data  in  which  the  true  x{k)  may  be  used  as  the  target  to  a  neural 
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network.  Our  assumption,  however,  is  that  the  clean  speech  is  never  available;  the 
goal  is  to  estimate  x(k)  itself  from  the  noisy  measurements  y{k)  alone. 

In  order  to  solve  this  problem,  we  assume  that  /(•,•)  is  in  the  class  of  feedforward 
neural  network  models,  and  compute  the  dual  estimation  of  both  states  x  and  weights 
w  based  on  a  Kalman  filtering  approach.  In  this  paper  we  provide  a  basic  description 
of  the  algorithm,  followed  by  a  discussion  of  experimental  results. 

DUAL  EXTENDED  KALMAN  FILTERING 

By  posing  the  dual  estimation  problem  in  a  state-space  framework,  we  can  use  Kalman 
filtering  methods  to  perform  the  estimation  in  an  efficient,  recursive  manner.  At  each 
time  point,  the  Kalman  filter  provides  an  optimal  estimation  by  combining  a  prior 
prediction  with  a  new  observation.  Connor  et  a/. [4],  proposed  using  an  extended 
Kalman  filter  with  a  neural  network  to  perform  state  estimation  alone.  Puskorious 
and  Feldkamp  [13]  and  others  have  posed  the  weight  estimation  in  a  state-space  frame¬ 
work  to  allow  for  efficient  Kalman  training  of  a  neural  network.  In  prior  work,  we 
extended  these  ideas  to  include  the  dual  Kalman  estimation  of  both  states  and  weights 
for  efficient  maximum-likelihood  optimization  (in  the  context  of  robust  nonlinear 
prediction,  estimation,  and  smoothing)  [15].  The  work  presented  here  develops  these 
ideas  in  the  context  of  speech  processing. 

A  state-space  formulation  of  (1)  and  (2)  is  as  follows: 

x{k)  =  Flx(k-1)]  +  Bv{k),  (3) 

y(k)  =  C'x(^) -t- (4) 


where 


■  a!(fc) 
x(k  ~~ 

1) 

,  FMk)]  = 

f(x(k),...,x{k-M+l),yf)  • 
x(k) 

x(fc)  = 

x{k  — 

M  +  l)  . 

_x(k-M  +  1i) 

C=[  1  0  •••  0  ],  B  =  CF  (5) 

If  the  model  is  linear,  then  /(x(^))  takes  theform  w^x(/:),  and  F[x(/:)]  can  be  writ¬ 
ten  as  Ax(k),  where  A  is  in  controllable  canonical  form.  We  initially  assume  the 
noise  terms  v{k)  and  n{k)  are  white  with  known  variances  and  respectively. 
Methods  for  estimating  the  noise  variances  directly  from  the  noisy  data  are  described 
later  in  this  paper. 

Extended  Kalman  Filter  -  State  Estimation 

For  a  linear  model  with  known  parameters,  the  Kalman  filter  (KF)  algorithm  can  be 
readily  used  to  estimate  the  states  [9].  At  each  time  step,  the  filter  computes  the  linear 
least  squares  estimate  x(k)  and  prediction  x~  {k),  as  well  as  their  error  covariances, 
Px(k)  and  (k).  In  the  linear  case  with  Gaussian  statistics,  the  estimates  are  the 
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minimum  mean  square  estimates.  With  no  prior  information  on  x,  they  reduce  to  the 
maximum  likelihood  estimates. 

When  the  model  is  nonlinear,  the  KF  cannot  be  applied  directly,  but  requires  a  lin¬ 
earization  of  the  nonlinear  model  at  the  each  time  step.  The  resulting  algorithm  is 
called  the  extended  Kalman  filter  (EKF),  and  effectively  approximates  the  nonlinear 
function  with  a  time-varying  linear  one.  The  EKF  algorithm  is  as  follows: 


x-(i) 

^-4 

1 

rH 

1 

II 

(6) 

=  APi(k  -  +  BctIb'^  ,  where 

(7) 

x(fc-l) 

K(k) 

=  Pi  {k)Cf^iCPi  ik)C^  + 

(8) 

Pi{k) 

=  (I-K{k)C)Pr{k) 

(9) 

x(*) 

=  x-(k)  +  K{k)(y(k)-Cx-(k)). 

(10) 

Dual  Extended  Kalman  Filter  -  Weight  Estimation 


Because  the  model  for  the  speech  is  not  known,  the  standard  EKF  algorithm  cannot 
be  applied  directly.  We  approach  this  problem  by  constructing  a  separate  state-space 
formulation  for  the  underlying  weights  as  follows: 


w(Ar)  =  w(A:  -  1)  (11) 

y[k)  =  f(x{k  -  l),w{k))-\-v{k) (12) 

where  the  state  transition  is  simply  an  identity  matrix,  and  the  neural  network  f(x{k- 
l),w{k))  plays  the  role  of  a  time-varying  nonlinear  observation  on  w.  These  state- 
space  equations  for  the  weights  allow  us  to  estimate  them  with  a  second  EKF. 


w-(^) 

K^k) 

P^(k) 

Mk) 


w(Ar-l) 

P^(k  -  1) 

Pi;ik)H(kf  {H(k)PT{k)H{kf  +  <7^  + 
(I-K^(k)H{k))Pr{k) 
vf-(k)  +  K^{k){y(k)  -  CF(x(k  -  1), *-(*))) 
C5F[x,w]| 


where  P{k)  = 


dw 


w{k  -  1) 


(13) 

(14) 

(15) 

(16) 

(17) 

(18) 


The  linearization  in  (18)  can  be  computed  as  a  dynamic  derivative  [16]  to  account 
for  the  recurrent  nature  of  the  state-estimation  filter,  including  the  dependence  of  the 
Kalman  gain  K{k)  on  the  weights.  The  calculation  of  these  derivatives  is  computa¬ 
tionally  expensive,  and  can  be  avoided  by  ignoring  the  dependence  of  x{k)  on  w.^ 
This  approximation  was  used  to  produce  the  results  in  this  paper.  The  use  of  the  full 
derivatives  is  currently  being  investigated  by  the  authors. 

^  This  is  equivalent  to  a  single-step  of  backpropagation  through  time  [16]. 
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Figure  1 :  The  Dual  Extend  Kalman  Filter  (DEKF).  EKFl  and  EKF2  represent  the  filters  for 
the  states  and  the  weights,  respectively. 

We  now  have  EKFs  for  estimating  both  the  states  x  and  the  weights  w,  resulting  in  a 
pair  of  dual  extended  Kalman  filters  (DEKF)  run  in  parallel  (see  Figure  1).  At  each 
time  step,  the  current  estimate  of  x  is  used  by  the  weight  filter,  and  the  current  esti¬ 
mate  of  w  is  used  by  the  state  filter.  For  finite  datasets,  the  algorithm  is  run  iteratively 
over  the  data  until  the  weights  converge. 

This  approach  to  dual  estimation  is  related  to  work  done  by  Nelson  [12]  in  the  linear 
case,  and  to  Matthews’  neural  approach  [11]  to  the  recursive  prediction  error  algo¬ 
rithm  [7]^.  In  the  speech  literature,  the  method  is  most  closely  related  to  Lim  and 
Oppenheim’s  approach  to  fitting  LPC  models  to  degraded  speech  [10].  It  also  relates 
to  Ephraim’s  model-based  approach  [6],  but  uses  nonlinear  estimation  to  fit  the  given 
data  instead  of  using  a  fixed  number  of  prespecified  linear  models. 

Nonstationary  White  Noise  Experiment 

The  result  of  applying  the  DEKF  to  a  speech  signal  corrupted  with  simulated  nonsta¬ 
tionary  bursting  noise  is  shown  in  Figure  2.  The  method  was  applied  to  successive 
64ms  (512  point)  windows  of  the  signal,  with  a  new  window  starting  every  Sms  (64 
points).^  Feedforward  networks  with  10  inputs,  4  hidden  units,  and  1  output  were 
used.  Weights  typically  converged  in  less  than  20  epochs.  The  results  in  the  figure 
were  computed  assuming  both  and  cr^  were  known.  The  average  SNR  is  improved 
by  9.94  dB,  with  little  resultant  distortion.  We  also  ran  the  experiment  when  and 
were  estimated  using  only  the  noisy  signal.  This  also  produced  impressive  results, 

2  An  alternative  approach  is  to  concatenate  both  w  and  x  into  a  joint  state  vector,  and  apply  the  EKF  to 
the  resulting  nonlinear  state  equations  (see  [7]  for  the  linear  case,  [17]  for  application  to  recurrent  neural 
networks).  This  algorithm,  however,  has  been  known  to  have  convergence  problems. 

^  A  normalized  Hamming  window  was  used  to  emphasizes  data  in  the  center  of  the  window,  and  deem- 
phasize  data  in  the  periphery.  The  standard  EKF  equations  are  also  modified  to  reflect  this  windowing  in 
the  weight  estimation. 
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Clean  Speech 


Figure  2;  Cleaning  Noisy  Speech  With  The  DEKF.  The  speech  data  was  approximately 
33,000  points  (4  seconds) long.  Nonstationary  white  noise  was  generated  artificially  and  added 
to  the  speech  to  create  the  noisy  signal  y. 

with  an  SNR  improvement  of  8.50  dB.  In  comparison,  available  “state-of-the-art” 
techniques  of  spectral  subtraction  [3]  and  adaptive  RASTA  processing  [1]  achieve 
SNR  improvements  of  only  .65  and  1.26  dB,  respectively. 

Colored  Noise 

For  most  real-world  speech  applications,  we  cannot  assume  the  noise  is  white.  For 
colored  noise,  the  state-space  equations  3  and  4  need  to  be  adjusted  before  Kalman 
filtering  techniques  can  be  employed.  Specifically,  the  measurement  noise  process  is 
given  its  own  state-space  equations, 

n{k)  =  Ann(k-l)-\-BnVn(k)  (19) 

n(k)  =  C„n(fc),  (20) 


where  n{k)  is  a  vector  of  lagged  values  of  n{k),  Vn  (k)  is  white  noise,  An  is  a  simple 
state  transition  matrix  in  controllable  canonical  form,  and  Bn  and  Cn  are  of  the  same 
form  as  5  and  C  given  in  (3)  and  (4).  Note  that  this  is  equivalent  to  an  autoregressive 
model  of  the  colored  noise,  which  may  be  fit  from  a  small  section  of  signal  where  the 
speech  is  not  present. 

With  this  formulation  for  the  colored  noise,  it  is  straightforward  to  augment  both  the 
state  x(k)  and  the  weight  w(k)  with  n(^),  and  write  down  combined  state  equations. 
Specifically,  (3)  and  (4)  are  replaced  by: 


■  x(k)  1  _  F[x(A:  -  1)]  1  .  f  5  0 

n(k)  J  “  Ann(k  -  1)  J  ^  [  0  Br 

m  =  [c  c„]  [  ] , 


v{k) 

Vn(k)  J  ’ 


(21) 

(22) 
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and  (11)  and  (12)  are  replaced  by: 


•  w(i)  ■ 

'  /  0  1 

■  w(*  -  1)  ■ 

+  f  ^  1 

.  “(*)  . 

,0  An  \ 

n{k  —  1) 

^  L  . 

y{k)  =  /(x(Ar-l),w(A:))  +  (7nn(A:)  +  t;(Ar). 


(23) 

(24) 


The  noise  processes  in  these  state-equations  are  now  white,  and  the  DEKF  algorithm 
can  be  used  to  estimate  the  signal.  Note,  the  colored  noise  explicitly  affects  not  only 
the  state  estimation,  but  also  the  weight  estimation. 

Clean  Speech 


Figure  3:  Removing  Colored  Noise  With  The  DEKF.  The  speech  data  is  3,500  points  long. 
An  actual  recording  of  stationary  colored  noise  was  added  to  the  speech  to  create  the  noisy 
signal  y. 

An  actual  recording  of  cellular  phone  noise  was  added  to  a  speech  signal  to  produce 
the  data  shown  in  Figure  3.  The  noise  was  modeled  as  an  AR(IO)  process  to  deter¬ 
mine  An  and  using  a  segment  of  the  data  (5 12  points)  where  no  speech  is  present. 
The  results  in  the  figure  reflect  assumed  knowledge  of  o-J,  and  showed  an  average 
SNR  improvement  of  5.77  dB.  When  cr^  is  estimated,  the  SNR  improvement  is  5.71 
dB.  In  this  case,  spectral  subtraction  and  adaptive  RASTA  processing  produced  com¬ 
parable  SNR  improvements  of  4.87  dB  and  5.27  dB,  respectively. 

MONAURAL  BLIND  SIGNAL  SEPARATION 

If  the  additive  noise  is  colored  and  highly  nonstationary,  the  distinction  between  what 
is  signal  and  what  is  noise  becomes  somewhat  arbitrary.  For  this  reason,  we  con¬ 
sider  the  observation  y{k)  —  x(k)  n{k)  to  represent  the  addition  of  two  signals 
y[k)  =  xi  (k)  +  X2{k).  We  simply  treat  the  noise  itself  as  an  additional  signal  that 
must  be  estimated.  This  is  a  form  of  blind  signal  separation,  in  which,  for  exam¬ 
ple,  the  signals  result  from  the  mixing  of  speakers.  The  problem  differs  from  recent 
methods  in  the  literature  [2]  for  blind  signal  separation  in  which  M  signals  must  be 
separated  from  M  observations  by  learning  a  fixed  “inverse  weighting”  matrix.  In¬ 
stead,  we  are  interested  in  separating  two  or  more  signals  from  a  single  (monaural) 
observation. 

Previous  work  on  monaural  signal  separation  has  primarily  been  based  on  harmonic 
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selection  and  pitch  tracking  in  the  frequency  domain  [5].  In  contrast,  we  estimate 
each  signal  by  learning  a  set  of  short-term  models  which  best  separate  the  signals. 
Extension  of  the  DEKF  framework  is  straightforward.  Specifically,  we  formulate 
state  equations  containing  xi  (k)  and  X2  (k) 


■  xi(k)  ■ 

Fi[xi(fe-1)]  1 

_1_ 

'  Bi  0  ■ 

■  vi{k)  ‘ 

.  ^2{k)  . 

T— 1 

I 

-5i; 

CM 

X 

0  B2  ^ 

.  Mk)  . 

y{k) 


[Cl 


Ci] 


Xl(fc) 

X2(fc) 


(25) 

(26) 


These  equations  are  analogous  to  the  colored  noise  formulation  (22)  and  (21),  where 
n  is  replaced  by  X2  and  has  corresponding  nonlinear  model  F2  with  parameters  W2 . 
To  estimate  the  weight  parameters  wi  for  the  neural  network  Fi  associated  with  xi 
we  form  the  state  equations 


•  Wi(*)  ■ 

X2(*)  . 

y{k)  =  f 

wi(^  -  1) 


1])  J 


+ 


0 

B2 


V2(k), 


=  /i(xi(^  -  l),wi(^)) C2X2(fe)  4-  Vi(k), 


(27) 

(28) 


where  the  state  X2  is  included  so  that  the  associated  noise  process  is  white  (compare 
to  (23)  and  (24)).  Finally,  for  estimating  the  second  set  of  weight  parameters  W2  for 
F2  associated  with  X2,  we  add  a  third  set  of  state  equations 


W2(<;) 

xi(*:) 

y{k) 


W2(*-l) 

1 

0 

i='i[x,(fc  -  1]) 

to 

Mx2{k-l),W2(k))  +  CiXi(k)  +  V2(k). 


(29) 

(30) 


It  is  straightforward  to  show  that  given  a  known  (linear)  model  for  each  signal,  both 
xi{k)  and  X2  (k)  are  observable  from  the  additive  observation  y{k) .  However,  show¬ 
ing  that  model  parameters  may  be  jointly  learned  from  the  observations  alone  is  a 
much  more  difficult  problem.  Nevertheless,  some  simple  preliminary  experiments 
have  been  performed  which  indicate  the  potential  of  this  approach.  Figure  4  illus¬ 
trates  the  blind  separation  of  two  segments  of  speech  (male  / s/  and  female  / el/) 
which  have  been  added  together  and  then  separated  in  this  manner.  While  encourag¬ 
ing,  the  approach  should  be  viewed  as  only  a  starting  point  to  a  model  based  frame¬ 
work  for  approaching  blind  signal  separation  problems. 


ESTIMATING  NOISE  VARIANCES 

In  the  implementation  of  the  DEKF,  it  is  assumed  that  the  variances  ofv{k)  and  n(k) 
in  (3)  and  (4)  are  known  quantities.  In  practical  applications,  however,  the  noise 
variances  must  be  estimated  from  the  noisy  data.  We  have  investigated  several  ap¬ 
proaches  for  doing  this  in  the  speech  processing  domain. 

Additive  Noise  Statistics:  Assuming  stationarity  of  the  additive  noise,  the  noise  vari¬ 
ance  (t\  (or  its  full  autocorrelation)  may  be  estimated  from  segments  of  the  data  y(k) 


All 


Segment  of  Combined  Speech  —  Monaural  Recording 


Figure  4:  Blind  Separation.  Top  figure  shows  the  combined  signal  supplied  to  the  algorithm. 
In  the  bottom  two  figures,  the  original  speech  is  shown  with  a  dashed  line,  and  the  estimated 
signal  with  a  solid  line.  100  data  points  are  shown. 

that  do  not  contain  speech.  For  slow  variations  in  the  noise  statistics,  Hirsch  [8]  has 
proposed  an  approach  based  on  histograms  of  spectral  magnitudes  which  does  not 
require  explicit  segmentation  of  the  data  into  speech  and  non-speech  segments. 

For  rapidly  changing  noise  (e.g.,  background  chatter,  wind  noise,  and  artifacts  in¬ 
troduced  by  automatic  gain  control)  we  are  interested  in  short-term  estimates  of  the 
noise  statistics  for  each  window  of  noisy  speech  data.  An  approach  we  have  devel¬ 
oped  for  nonstationary  white  noise  sources  re-estimates  cr^  as  follows.  First,  we  note 
that  the  optimal  weights  for  the  linear  estimator 

M 

®(*)  =  -  0  =  w^y(^).  (31) 

»=0 

may  be  expressed  as 

■**  =  ^yyi^yy  -  (32) 

where  Ryy  is  the  sample  autocorrelation  of  the  noisy  speech,  and  ei  =  [1  0  •  •  •  0]. 
Next,  we  choose  in  (32)  such  that  w  leads  to  a  minimum  variance  estimator.  We 
have  shown  that  this  provides  an  upper  bound  on  cr^ .  Starting  at  this  upper  bound, 
is  iteratively  decreased  until  wq  >  Wi  Vi  ^  0,  which  forces  the  current  observation 
to  have  the  greatest  influence  the  estimator  output  relative  to  other  observations.  A 
new  5-^  is  then  re-estimated  for  each  short-term  window.  This  technique  was  used 
for  the  results  given  with  the  nonstationary  white  noise  experiment. 

Process  Noise  Variance:  To  estimate  crj  (assuming  an  LPC  model  for  the  signal), 
Lim  and  Oppenheim  [10]  used  an  expression  for  the  inverse  Fourier  transform  of  the 
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signal  power  (which  is  a  function  of  a^).  We  have  developed  an  alternative  approach 
by  noting  that  the  process  noise  variance  can  be  estimated  as  the  mean  squared 
error  of  a  linear  AR  predictor  on  the  clean  data  x{k)'^.  Specifically, 

where  pxx  is  the  cross-correlation  between  the  lagged  input  vector  x{k  -  1)  and  the 
current  x{k),  and  is  the  autocorrelation  of  the  inputs.  In  our  setup,  only  the  noisy 

signal  y(k)  with  prediction  residual  trj  +  (rj  -  pJyR"^  Pyy  is  available.  We  can 
approximate  -  p^a^Rj^Pxx  using: 

—  ^y~  P®®  ^  Pl/y  ~  Pnm  Rarx  ==  Ryy  ““  Rnn-  (33) 

This  results  in  the  following  estimate: 

Note  that  when  n(k)  is  white,  the  teims  in  (33)  simplify  because  p„„  =  a\ex  and 
R„„  =  where  the  additive  noise  variance  is  estimated  as  above. 

While  these  “ad-hoc”  methods  were  used  in  the  experiments  reported  in  this  paper, 
estimating  the  noise  variances  remains  a  critical  area  for  future  work.  Our  current  di¬ 
rection  is  to  treat  and  al  as  additional  parameters  which  may  be  optimized  within 
the  Kalman  and  maximum-likelihood  framework. 

CONCLUSION  AND  FUTURE  WORK 

We  have  presented  a  DEKF  algorithm  with  preliminary  results  on  its  application  to 
speech  enhancement  in  the  presence  of  both  nonstationary  and  colored  noise.  The 
approach  performs  significantly  better  than  the  current  state-of-the-art  on  the  reduc¬ 
tion  of  nonstationary  noise,  and  performs  well  on  colored  noise  problems  as  well. 
In  addition,  the  application  of  the  approach  to  monaural  blind  signal  separation  was 
considered  as  a  special  case  of  the  nonstationary  colored  noise  problem. 

Future  work  will  include  additional  approaches  to  variance  estimation,  as  well  as  the 
coupling  of  error  statistics,  windowing  aspects,  recurrent  training  implications,  and 
forward-backward  methods  for  smoothing.  An  additional  aspect  that  is  currently  un¬ 
der  consideration  is  the  minimization  of  both  prediction  and  estimation  errors  by  the 
weight  filter.  While  the  current  implementation  minimizes  only  prediction  error  of 
the  model,  the  full  errors  in  variables  cost  function  [14,  15]  can  be  minimized  by 
a  two-observation  form  of  the  weight  filter.  This  refinement  will  be  discussed  in  a 
future  paper. 
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^  This  is  exact,  assuming  the  signal  is  generated  by  a  linear  autoregression. 
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ABSTRACT 

It  has  been  shown  that  a  particular  case  of  the  Bayesian  Ying-Yang 
learning  system  and  theory  will  reduce  into  a  very  general  ICA  frame¬ 
work.  It  not  only  includes  the  existing  information- theoretic  ICA  ap¬ 
proaches  as  particular  examples,  but  also  improve  their  performances, 
extend  them  to  handle  the  cases  that  sensors  are  affected  by  noises  and 
outliers  and  cases  that  the  number  of  sensors  is  larger  than  the  number 
of  sources,  and  also  be  able  to  detect  the  correct  number  of  sources. 
Algorithms  are  developed  for  implementing  this  ICA  framework  both 
in  its  general  form  and  in  its  simplified  versions  for  two  important 
special  cases,  supported  by  some  theoretical  results  and  experimental 
demonstration. 

1.  Introduction 

Recently,  Independent  Component  Analysis  (ICA)  has  received  increasing 
attention  with  many  applications  [1].  Here,  we  consider  a  widely  used  for¬ 
mulation  for  the  ICA  problem.  Given  that  there  are  n  channels  of  unknown 
source  signals  s  =  [si ,  •  •  • ,  which  are  mutually  independent  with  Es  =■  0. 
The  observations  from  n  sensors  are  given  as  =  AoS  with  Ao  being  an  n  x  n 
unknown  nonsingular  mixing  matrix.  The  objective  is  to  find  a  so-called  de¬ 
mixing  matrix  W  such  that 

y  =  Wx  =  WAoS  =  Ys,  V  =  WAo  =  UD  (1) 

with  H  being  a  permutation  matrix  and  D  being  a  non-singular  diagonal  ma¬ 
trix.  That  is,  y  recovers  s  up  to  unknown  scales  and  a  permutation  of  indices. 
Theoretically,  we  can  get  such  an  W  as  long  as  it  makes  {t/i,  -  •  • ,  mutually 
independent  when  there  is  at  most  one  signal  in  is  gaussian, 

which  motivated  the  name  [1 ,2] .  A  recent  popular  stream  for  the  problem  is 
called  information- theoretic  approach.  One  idea  is  the  minimization  of  the 
mutual  information  (MMI)  J(W)  =  p(y)lnp(y)/|~[r=i  because  J{W) 

is  the  minimum  when  mutually  independent  [1,2,3].  The  other 

idea  is  to  maximize  the  entropy  (Informax)  J{W)  =  -  J^p(z)  In p{z)dz  with 
2  =  bi(vi),-'*.Pn(?/n)]  [4].  Both  are  shown  to  be  special  cases  of  a  general 
ICA  framework  and  are  equivalent  when  the  accurate  Pi{yi)  is  available  [9]. 
However,  the  accurate  Pi{yi)  is  difficult  to  get.  In  [2,3],  it  is  approximated 
by  a  pre-fixed  truncated  Edgeworth  series  or  truncated  Gram-Charlier  series. 

In  [4],  it  is  simply  fixed  at  pre-given  densities  Pi(yi)  =  with  s{yi)  being 

one  of  those  sigmoid  functions  used  in  the  literature  of  neural  networks,  and 
understandably  such  an  idea  will  only  work  for  some  special  type  of  source 
signal  (e.g.,  super-gaussians) .  In  [10],  a  new  strategy  is  suggested,  in  which 
piiyi)  is  no  longer  prefixed  but  learned  via  a  flexible  density  function  based 
a  finite  mixture  of  densities.  It  has  been  shown  experimentally  that  this  new 
method  can  work  well  in  not  only  the  examples  that  the  methods  in  [1,3,4] 
work,  but  also  the  examples  that  the  methods  in  [1,3,4]  fail. 

♦Supported  by  This  project  was  supported  by  the  HK  RGC  Earmarked  Grants 
CUHK250/94E,  CUHK484/95E,  and  Ho  Sin-Hang  Education  Endowment  Fund  HSH 
95/02.  Email  lxu@cs.cuhk.hk 
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The  present  author  has  a  strong  belief  that  the  direction  of  trying  approxi¬ 
mates  pi(2/j),  though  justified,  is  over-necessary.  Therefore,  in  [9]  the  marginal 
density  Pi{yi)  was  actually  replaced  by  a  general  integrable  function  gi(yi)^ 
while  what  kind  of  this  function  should  be  remained  unclear.  In  [11],  some 
examples  of  gi{yi)  that  is  able  to  implement  ICA  under  certain  conditions 
have  been  further  given  with  theoretical  analyses.  Of  course,  not  every  of 
gi[yi)  can  success  in  ICA,  the  workable  gi(yi)  must  come  from  a  family  Q. 
However,  what  kinds  of  properties  this  family  should  bear  still  remain  as  open 
a  question. 

In  recent  years,  a  so  called  Bayesian  Ying-Yang  (BYY)  Learning  system  and 
theory  has  been  developed  as  a  unified  statistical  learning  approach,  which  can 
provide  us  at  least  four  types  of  new  strengths  [5, 6, 7, 8].  First,  it  is  able  to  unify 
most  of  the  existing  major  statistical  learning  models  and  theories.  For  examples, 
for  unsupervised  learning  they  include  ML  learning  with  the  EM  algorithm,  in¬ 
formation  geometry  theory  with  the  em  algorithm,  MDL  autoencoder,  Helmholtz 
machine,  independent  component  analysis  (ICA)  by  INFORMAX  or  MMI,  LMSER 
learning,  principal  component  analysis  (PC A),  various  clusters  and  self-organizing 
maps;  and  for  supervised  learning  they  include  the  conventional  ML  learning  (i.e. 
BP  algorithm)  for  feed-forward  network,  ML  learning  for  RBF  nets,  mixture  of 
experts  and  its  alternatives.  This  powerful  unification  provides  us  not  only  deep 
insights  on  these  mentioned  popular  existing  approaches  but  also  further  guidances 
on  obtaining  their  new  variants  or  extensions  via  cross-fertilization.  Second,  some 
special  cases  of  the  BYY  Learning  bring  us  several  interesting  new  models  on  both 
unsupervised  and  supervised  learnings,  which  deserve  further  investigation.  Third, 
the  BYY  Learning  theory  can  function  as  a  general  theory  not  only  for  parameter 
learning  ,  but  also  for  model  selection  (or  more  precisely  called  structural  scale 
selection),  e.g.,  for  selecting  subspace  dimension,  number  of  clusters,  number  of 
gaussians,  number  of  experts,  number  of  hidden  units,  etc.  Finally,  this  same 
theory  can  also  function  as  a  general  theory  from  regularization  and  architecture 
evaluation.  Readers  are  referred  to  [5,6]  for  a  rather  systematic  review  on  previous 
results  and  several  new  advances. 

In  this  paper,  we  show  that  a  particular  case  of  the  BYY  learning  system 
and  theory  will  reduce  into  a  very  general  enhanced  information-theoretic 
ICA  framework  with  several  new  powers.  After  briefly  introducing  the  BYY 
learning  system  and  theory  in  Sec. 2,  we  propose  this  ICA  frame  work  in 
Sec. 3.  Then,  in  Sec.4  the  gradient-based  ICA  algorithms  are  developed  in 
its  general  forms  and  two  detailed  implementations.  In  Sec. 5,  the  algorithms 
for  two  important  special  cases  are  investigated  in  details  with  three  impor¬ 
tant  theorems.  Finally,  in  Sec. 6  an  variant  ICA  algorithm  is  proposed  and 
demonstrated  to  be  more  robust  for  outliers. 

2.  BYY  Learning  System  and  Theory 

The  perception  tasks  can  be  summarized  into  the  problem  of  estimating 
joint  distribution  p{x,y)  of  the  observable  pattern  x  in  the  observable  space 
X  and  its  representation  pattern  y  in  the  representation  space  Y,  as  shown 
in  Fig.l.  We  call  a  passage  Myja,  for  the  flow  like  x  y  a,  Yang/ {male) 
passage  since  it  performs  the  task  of  transferring  a  pattern/(a  real  body) 
into  a  code/(a  seed).  We  call  a  passage  for  the  flow  y  x  as  a 

Ying/{fema\e)  passage  since  it  performs  the  task  of  generating  a  pattern/(a 
real  body)  from  a  code/ (a  seed).  and  M^^y  are  complement  to  each  other 
and  together  implement  an  entire  circle  x  y  x.  Interestingly,  under  the 
Bayesian  framework  we  also  have  two  representations  p(x,y)  =  p(y\x)p{x) 
and  p{x,y)  =  p{x\y)p{y).  We  use  a  Yang/(visible)  model  representing 
p(x)  (i.e.,  modeling  the  space  A),  and  we  use  a  Ying/ (invisible)  model  My 
representing  p(y)  (i.e.,  modeling  the  space  Y).  Moreover,  My|j,  is  represented 


477 


Figure  1  The  joint  spaces  X,  Y  and  the  YING-YANG  System 

t>y  PMy^Sy\x)  and  by  pM^^y(x\y).  Together,  we  have  a  YANG  machine 
Ml  =  {My\^,Mx}  to  implement  pMi(a?,y)  =  and  a  YING 

machine  M2  =  {Ma;\y,My]  to  implement  pM2{x,y)  =  PM:,\y{^\y)PMy{y)^  A 
pair  of  YING-YANG  machines  is  called  a  YING-YANG  pair  or  a  YING- 
YANG  system.  Such  a  formalization  compliments  to  a  famous  Chinese  an¬ 
cient  philosophy  that  every  entity  in  universe  involves  the  interaction  between 
YING  and  YANG. 

The  task  of  specification  of  a  Ying-Yang  system  is  called  learning  in  a  broad 
sense.  First,  we  need  to  specify  the  variables  x,  y.  Usually,  x  is  assumed  to  be 
X  e  R^.  But  y  can  be  an  y  €  an  integer  y  G  [1,2,  and  a  binary 

y  =  [yi, . . . ,  y;5^],  yi  €  [O,  l],  where  kr  represents  the  scale  or  complexity  of  repre¬ 
sentation.  Next,  we  specify  four  components  PMx(a?)»  PMy^^{y\x),  pM^|j,(a:|y)  and 
PMyiv)-  Generally  speaking,  each  of  them  is  specified  by  three  parts.  The  first 
part  is  called  Architecture  Design,  denoted  by  Sa,  consisting  of  the  general  setting 
on  (a)  its  density  function  form  p  (.),  (b)  one  or  several  types  of  basic  structural 
units  and  (c)  an  architecture  for  organizing  a  number  of  these  structural  units.  The 
second  part  is  called  model  selection  or  more  precisely  Structural  Scale  Selection 
for  selecting  scale  parameters  k  =  [kr,  fcj]  which  consists  of  the  above  representation 
scale  kr  and  the  scale  or  complexity  k^  of  those  more  complicated  basic  structural 
units  themselves.  The  third  part  is  called  Parameter  Learning  or  Estimation,  also 
often  called  /earnmy  simply  in  a  narrow  sense,  for  specifying  a  particular  value  of  6 
—  a  set  of  real  variables  on  certain  domain.  Together,  for  each  a  G  x\y,  yja:,  y}, 
each  Ma  =  {Sa,Ba,k}  is  specified  only  after  all  its  three  parts  are  specified.  Some 
examples  are  given  in  [7]  to  this  formalization  better. 

Our  basic  theory  is  that  the  specifications  of  the  three  levels, 
namely  Architecture  Design,  Structural  Scale  Selection  and  Parameter  Learn¬ 
ing  should  best  enhance  the  so  called  Ying-  Yang  Harmony  or  Marry,  through 
minimizing  a  so  called  separation  functional: 

Fa  (Ml,  M2)  =  F3{PMy\^{y\^)PM:,ix),PM^^yi^\y)PMy{y))  >  0, 

Fs(Mi,M2)  =  0,  if  and  only  if  PMy\^{y\^)PMA^)  =  PM^\y{Ay)PMy{y)^  (2) 

which  describes  the  harmonic  degree  of  the  Ying-Yang  pair.  Such  a  learning 
system  and  theory  is  called  as  Bayesian  Ying-Yang  (BYY)  Learning  System 
and  Theory. 

Three  categories  of  separation  functionals,  namely  Convex  Divergence,  Lp  Di¬ 
vergence,  and  De- correlation  Index,  have  been  suggested  in  [6].  The  Convex  Diver¬ 
gence  is  defined  as 

Fs(pi,P2)  =  /(I)  -  ^  convex  on  (0,-|-oo),  (3) 

from  which  we  get  Fs(Mi,M2)  by  substituting  pi  with  pMy\,,{y\x)pM^{oo)  and  p2 
with  PM^|j,(a:|y)pM3/(y)-  Its  three  typical  examples  are  given  as  follows: 

(a)  /(«)  =  In  u,  which  leads  us  to  the  Kullback  Divergence: 

KL(M,.M2)  =  In  j  J  (4) 

In  the  special  case,  the  BYY  learning  is  called  Bayesian-Kullback  YING-YANG 
(BKYY)  learning.  It  is  a  most  useful  case  and  has  been  extensively  studied  [5, 6, 7, 8]. 

(b)  f(u)  =  >  1,  called  as  Minus  Convex  divergence. 
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(c)  When  f{u)  =  u^,0  <  /3  <  1,  we  called  as  Positive  Convex  (PC)  divergence. 
Interestingly,  when  f3  =  0.5,  it  leads  to  a  Root-Inner-Product  (RIP)  divergence 
Fs{pi,P2)  ~  1  ”  \/pi{^)p2 {x)dx^  which  has  a  nice  symmetric  feature  that  the 

KuUback  divergence  does  not  have. 

From  the  BKYY  learning  eq.(4),  we  can  obtained  those  four  new  strengths 
mentioned  in  Sec.l  for  unsupervised  learning.  Moreover,  by  replacing  the  KuUback 
divergence  eq.(4)  with  the  above  mentioned  other  non-Kullback  separation  function¬ 
als,  we  can  obtain  various  alternatives  for  those  KuUback  divergence  based  learning 
models,  with  some  new  interesting  properties  (e.g.,  robust  learning),  we  call  them  a 
Bayesian  Non-KuUback  separation  functionals  Ying-Yang  (BNYKK)  learning.  Also, 
the  BYY  learning  eq.(5)  can  be  extended  into  a  more  general  BYY  learning  system 
such  that  those  four  new  strengths  mentioned  in  Sec.l  for  supervised  learning  can 
also  obtained  [12,13],  both  for  the  cases  that  the  KuUback  divergence  eq.(l)  is  used 
and  for  the  cases  that  those  non-KuUback  separation  functionals  are  used. 

In  the  rest  of  this  paper,  we  will  only  concentrate  on  a  particular  case  of 
the  BYY  learning  system  and  theory  that  leads  to  ICA  problem. 


3.  General  BYY  ICA  Framework  and  BKYY  ICA  Scheme 


We  modify  the  problem  eq.(l)  into  a  more  general  case  that  sensors  are 
affected  by  noises  and  the  number  m  of  sensors  may  be  larger  than  the  number 
n  of  sources.  That  is, 

X  =  As  +  ex,  s  =  2/  +  ej/,  i/  =  Wx,  m  >n,  (5) 

where  the  dimension  for  y,s  vectors  is  n  and  for  x  is  m,  also  we  have  that 
Es  —  0,  the  m  components  of  are  mutually  independent  with  Ee^  =  0  and 
also  independent  of  s,  and  thus  also  Ex  =  0,  Ey  =  0  and  Ecy  =  0. 

Here  we  consider  that  the  Ying  space  is  for  s  and  the  Yang  space  is  for  x.  The 
Yang  density  is  pM^  (^)  =  p(^)  which  can  be  estimated  from  the  data  by  sensors. 
The  Ying  density  is  unknown  exactly,  but  we  can  assume  that  PM^is)  =  g{3,d)  ~ 
nr=i  with  gi{si,6i)  being  of  simple  density  function  form  specified  but 

6i  is  variable  in  ©  that  actuaUy  represents  the  family  Q,  The  Yang  passage  s  = 
y  +  €y,  y  ~  Wx  is  described  by  a  density  PMg|^(s|a;)  for  the  random  variable  By 
under  each  given  x  (thus  y).  The  Ying  passage  x  =  As+^xis  described  by  a  density 
PMa.|j(a:|s)  for  the  random  variable  imder  each  given  x.  Furthermore,  we  can 

assume  that  ex  is  from  gaussian  G(ea;,0,  <t^/),  and  thus  PM^|,(2:|s)  =  G(a:,  As,  I). 
Moreover,  the  fact  s  =:  W As  +  Wex  +  ey  suggests  that  a  reasonable  case  is 

WA  =  I,  andthus  A  =  W~  zzW'^iWW'^)-^,  ey  -Wex.  (6) 

Thus,  PM^^,{x\s)  —  G{x,W~s,a^I),  and  ey  is  a  gaussian  G{ey,0,WW'^ I)  and 
independent  of  x. 

In  summary,  we  have  specified  the  architectural  design  as  follows 

PMxi^)  =  P(®)»  PM,|^(sk)  =  G{s,Wx,a'^WW'^), 

PM^|,(a:|s)  =  G{x,W-s,<t'^I),pm,{^)  -  9{s,d)  =  0?=!  9i{si,9i).  (7a) 

Next,  according  to  the  Ying-Yang  learning  theory,  we  specify  the  remaining 
un-specified  items,  namely,  (a)  the  structural  scale  k  =  kr  =  n,  i.e.,  the  num¬ 
ber  of  sources;  (b)  <7^;  (c)  W]  and  (d)  0,  via  eq.(2)  to  minimize  Fs{Mi,  M2): 

J(W,a2,n,0), 

J(  W,  ,n,e)  =  FsiG{s,  Wx,  a^WW'^)p{x),G{x,  W“s,  I)g{s,  e))dxds.  (76) 


which  is  called  BYY  ICA  framework.  Particularly,  when  the  KuUback  diver¬ 
gence  eq.(4)  is  used,  it  is  simplified  into 

>n,e),  J{W,a^,n,e)  = 


G(..  ly.,  ,n 


f  m  d'Xds. 


Which  is  called  BKYY  ICA  Scheme.  Due  to  the  space  limit,  the  theoretical 
justification  of  eq.(7b)  and  eq.(8)  is  given  elsewhere  [14].  Theoretical  jus¬ 


tification  for  special  cases  will  be  given  in  Sec.5.  From  eq.(8),  we  see  that 


479 


should  be  irrelevant  to  the  term  f^p{x) In p{x)dx  and  thus  can  be 
omitted.  As  shown  in  details  in  [14],  we  can  equivalently  re-write  eq.(8)  into 

Js{W,  ^2 ,  n,  0)  =  0.5[(7n  -  n)  In  -  In  \WW^  |  +  m  -  n]  EJg{x,  W, 

JG{x,W,a^,9)  =  -  f^G{s,Wx,a^WW'^)\ng{s,e)ds,  5(5,  =  nr=i 

{W*j:,}  =  argmmlw,eee}MW,a^.n,9),  s.t.  =  ^E\\x  -  W-Wx\\^ 

n*  =  argminn  j{n),  J{n)  =  Js{W*,a1^  ,n,9^).  {Q) 

With  this  BKYY  ICA  scheme,  the  best  parameters  W*,9n,a*^  are  es¬ 
timated  for  each  given  n,  and  then  a  best  number  n*  of  sources  is  found 
via  n*  =  argmiun  J{n)  with  the  corresponding  VF*, , 0*,  ,(7*2  as  the  final  best 
estimations. 

4.  The  Gradient-Based  Algorithms  for  The  BKYY  ICA 

To  practically  implement  BKYY  ICA,  the  key  is  how  to  to  get  W*,9*,a*^ 
for  each  given  n.  Here,  we  adopt  the  gradient-based  method.  For  simplicity, 
we  will  omit  the  subscript  n  in  any  cases  without  confusions. 

First,  we  consider  JG{x,W,cr'^ ,9)  by  Taylor  expansion  of  \ng{s,0)  around 
y  =  Wx,  and  then  use  the  expansion  to  do  the  integr^: 

Jg{x,W,(t^,9)  =  -  f^G{s,Wx,(T^WW'^)\ng{s)ds 
=  -[lng{Wx,9)  +  ^aHriDhWW^)  +  o(cr2)], 
h(3,e)  =  Dh  =  dh{s,9))/de'^.  (loa) 

For  simplicity,  we  approximately  just  use  the  first  term  —  \n.g{Wx,  6)  which  is  valid 
when  is  small.  Therefore,  we  have 

dJG{x,W,<T‘^{W),9)ldW  =  -h{Wx,9)x'^.  (106) 

Next,  we  have  that  £72  =  ~^E\\x  -  W~Wx\^  = 

=  tr[RA  -  tr[RsW'^{WW'^)-'^W],  R:,  =  E{xx'^), 

^  =  ^^2{WW'^)~‘^[WRa:  -  WRa:W'^{WW'^)-^W]. 

Together  with  — - . ^  =  2(WW^)""^W  ,  it  follows  from  eq.(9)  that 

_  E[hiWx,9)x'^]. 

Therefore,  we  can  get  the  following  general  forms  of  both  the  batch  way 
and  adaptive  gradient  algorithms  for  updating  W,  cr^  with  a  stepsize  g: 

AW  =  -vi^w^  -  {WW'^)-'^W  -  E[h{Wx,9)x'^]}, 

£72*"  =  -  W^iWW^^-^WxW^.  (11) 

AVF  =  -77{A^y2  -  iWW'^)-'^W  -  h{Wx,9)x'^}, 

^new  2  „  ^old  2  ^  y^^\\x  -  W'^  {WW'^)-'^Wxf .  (12a) 

Similarly,  we  can  get  the  general  forms  of  both  batch  way  and  adaptive 
algorithms  for  updating  0  by 

A9  =  nE^^^%^^’^K  A0  =  r,^l=4^.  (126) 

Thus,  our  gradient  algorithm  can  be  summarized  as  an  iterative  process 
that  in  each  iteration  both  eq.(  11)  and  eq.(12b)  are  used  for  updating  W,  cP- 
and  9.  As  well  known,  as  long  as  the  learning  stepsize  controlled  appropri¬ 
ately,  a  gradient  descent  iterative  process  will  guarantee  to  converge  to  at  least 
a  local  minimum  of  JsiW,a^ ,n,9).  So,  our  iterative  process  is  guaranteed  to 
converge,  and  then  we  use  the  converged  results  as  our  estimates  W*,9*,a’^^. 

To  get  further  detailed  algorithms,  we  focus  on  two  types  of  5^(5,  $)  family.  The 
first  one  is  the  finite  mixture  of  parametric  densities 

g{s,9)  =  o;ip(5|Aj),  =  ba.  >  0, 

0  =  [Ai  p(slAj)  =  nr=i 

where  p(5i|Aj,i)  is  some  simple  density  function.  In  this  case,  we  have 
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(134) 


(15a) 

(156) 

(15c) 


=  fs 

For  a  gaussian  p(s|Aj)  =  G(s,0,Aj)  with  Aj  being  a  diagonal  matrix  ,  we  have 

h(s,^j)  =  -A-‘si,  — =  -^p(j|s)[A-*  -  A-»lya;xPiyPA-‘].  (14) 

Thus,  putting  them  into  eq.(12),  we  get  the  batch  way  gradient  algorithm: 

AVV  =  -n{A^2  -  {WW'^)-'^W  +  ^~^WE{p{j\Wx)xx'^)}- 

=  1  ocjp{Wx\Xj) 

^^Vh{x)v{3\Wx)dx,  Aj:=zWE{p{3\Wx)xx'^]W'^. 

Actually,  the  algorithm  for  9  by  eqs.(15b&c)  is  the  so  called  EM  algorithm.  Also, 
with  eq.(15b)  still  the  same,  we  have  the  adaptive  algorithm  given  by 

AIT  =  -77{A5^.  -  {WW'^y^W  +  A-^p{3\Wx)xx'^},  (16a) 

^new  ^  ^oid  +  r)p{j\Wx),  AY^  =  Af^  +  rip{3\Wx)Wxx'^W'^ .  (166) 

The  second  Q  family  is  called  order  exponential-polynomial  family: 

P(")  =  Er=iP‘(")  = 

flr(s,0)  =  c-^e(r:p(p(s)),  =  [5*,*-.,5j]^,fc  =  2,3,  (17) 

where  A^,  fc  =  2,  3, 4  are  diagonal  matrices,  which  are  prefixed  such  that  the  follow¬ 
ing  condition  holds: 

~  la  ^^vivi^Ylds  <  oo,  sy(s,  0)ds  =  0. 

Thus,  from  eq.(12),  we  can  get  the  algorithm  for  updating  W: 

AIT  =  -7?{Ai^2  -  {WW'^)-^W  E[{A-^Wx  -  A-^{WxY  +  A~^{Wxf}x'^]}.  (18) 

Also,  from  eq.(lOa)  we  have  that  for  A;  =  2,3, 4 

“  J'^£f(5,0)s*s^rfs  +  S*S^]A~E 

Unfortunately,  the  gradient  is  difficult  to  be  used  since  f  g{s,e)s’^ s'^ds  is  difficult 
in  computation,  although  it  is  well  defined.  So,  Afc,  A:  =  2, 3, 4  is  prefixed  above. 


5.  BKYY  ICA  in  Two  Special  Cases 

The  first  special  case  is  that  there  is  no  noise,  i.e.,  x  =  with  ej  =  0. 
The  algorithms  given  in  Sec. 4  still  works  here  with  an  interesting  property. 

When  m  ^  n,  from  the  fact  (m  -  n)(T^  -  El|a:  -  W--Wx\\'^tr[E{s°s°'^)A'^{I  - 
W  IT)  (/  —  W  lT)Ao],  and  the  fact  that  E[s°s°^)  is  diagonal,  we  have  that  =  0 
if  and  only  if  tr[A'^{I  —  1T“1T)^(/  ~  TT“lT)Ao]  =  0  or  equivalently  (7  —  1T“  W)Ao  = 
(7  -  W'^{WW'^)~^W)Ao  =  0,  which  is  true  if  and  only  if  the  space  spanned  by  Ao  is 
contained  in  the  space  spanned  by  IT.  In  other  words,  Ao  =  W^B  with  n  >  Uo  and  B 
is  an  n  X  no  matrix.  In  this  case  from  eq. (9),  we  know  <72=0,  WAoE(s^s°'^)B'^W- 

]VAoE(sy°'^)B'^W  =  0.  Thus,  in  eq.(9)  we  have  Jo(x,lV,a^,0)  -00,  and  in  eq.(ll) 

and  also  in  the  subsequent  related  equations  we  have  ITjRa;  -  WRxW'^ {WW'^)~'^Wa'^ 
undefined.  Apparently,  our  method  can  not  normally  work  here. 

In  fact,  this  is  just  an  interesting  property  for  fast  detecting  the  number 
no  of  sources  in  the  situation  without  noise  affects.  In  other  words,  for  each 
n  we  can  run  any  of  our  algorithms  in  Sec. 4  until  either  normal  convergence 
or  =  0,  and  then  we  can  get  the  correct  by  simply  increasing  from  the 
lower  side  to  n^,  until  becomes  zero  (  or  very  small )  or  from  the  upside  to 
no  until  (T^  is  no  longer  regarded  as  zero. 

After  we  get  this  no,  we  can  simply  drop  the  term  (m  -  n)lno-^  in  eq.(9). 
Moreover,  since  <72  =  0,  Ja{W,<T^ ,n,0)  =  -\ng{Wx,0)  by  eq.(lOa)  is  exact  without 
any  approximation  anymore.  Thus,  eq.(9)  becomes 
JaiW,0)  =  -0.5ln(ITlT^|  -  Elng{Wx,0),  {W\0*}  =  arpmin^ty^eee}  Js{W,0).  (19) 
and  eq.(12a)  becomes  the  following  eq.(20)  with  eq.(12b)  used  still  for  9: 

E[h{Wx,0)x'^]],  £xW  =  y[{WW'^)~^W^h{Wx,0)x'^]  (20) 
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Also,  for  the  Q  given  by  eq.(13a),  we  have  that  eq.(15a)  and  eq.(16a)  become  the 
following  eq.(21)  with  eq.(15c)  and  eq.(16b)  used  still  for  respectively: 

AW  =  -n{{WW'^)-^W  -  AfWE[p{j\Wx)xx'^]}; 

AW  =  v[{WW'^)-^W  -  AJ^Wp{j\Wx)xx'^].  (21) 

Furthermore,  for  the  Q  given  by  eq.(17),  it  follows  that  eq.(18)  becomes: 

AW  =  ■n{{WW'^)-^W  -  E[{A-^Wx  -  A“^  {Wxf  +  (22) 

In  the  above  first  special  case,  if  we  further  let  m  =  n  =  no,  we  get  the  second 
special  case.  In  fact,  it  is  the  case  given  by  eq.(l)  that  has  been  widely  assumed  in 
the  literature.  In  this  case,  W,  Ao  are  both  no  x  no  nonsingular  matrices.  Thus,  we 
have  \WW'^\  =  11^12  and  eq.(19)  further  becomes 

J,{W,$)  :=  -\n\W\  -  E\ng{WxJ),  {W* ,6*}  =  argm\n[w,eee}  Js{W,e).  _  (23) 
which  is  exactly  the  so  called  information- theoretic  approach  or  maximum  likelihood 
method  [1,2, 3,4].  Moreover,  we  have  that  eq.(20)  becomes 

+  E{h{Wx,  e)x'^)],  AW  =  +  h{Wx,  e)x'^]  (24a) 

When  h{Wx,  9)  is  pre-fixed,  it  is  exactly  the  INFORMAX  algorithm  by  [4],  Via  modifying 
AW  by  W AWW'^ ,  eq.(24a)  becomes  the  natural  gradient  algorithm  for  MMI[3]: 

AW  =  -n[W  AWE{h{Wx,6){Wx)'^)],  AW  -  v[W  +  Wh{Wx,e){Wx)'^]  {24b) 

In  [3],  g{Wx,9)  (  thus  h{Wx,9))  is  pre-fixed  via  a  truncated  Gram-Charlier  series.  Fur¬ 
thermore,  by  using  eq.(24a)  together  with  eq.(12b)  for  learning  g{Wx,6)  via  updating 
e,  with  g{Wx,e)  defined  by  eq.(13a)  and  p(st|Aj,i)  being  the  derivative  of  a  simple  sig¬ 
moid  function,  then  we  get  the  so  called  Learned  Mixture  of  Parametric  Densities  for  the 
information-theoretic  approach  [10,11]. 

In  the  rest  of  this  section,  we  further  propose  two  algorithms.  The  first 
one  is  obtained  via  modifying  AW  by  WAWW'^  in  eq.(21)  and  together  using 
eq.(15c)  for  updating  0.  That  is,  we  have 

AW  =  77{W  -  IF  A-^E[p{j\Wx){Wx){Wx)'^]y,  (25a) 

c,=Ep{j\Wx),  ^,  =  WE\pU\Wx)xx'^]W■^.  (25b) 

AW  =  7j[W  A-^p(jlWx)(Wx)(Wx)^l  (25c) 

^neu)  _  ^old  ^  pp{j\Wx)^  =  A°^^  -J-  r}p{j\W x)W xx'^W'^ ,  {25d) 

which  is  a  new  Learned  Gaussian  Mixture  algorithm  for  the  problem  eq.(l). 
Its  advantages  are  shown  in  the  following  two  theorems. 

Theorem  1  For  the  problem  eq.(l)  and  using  the  batch  algorithm  eqs.(25a&b) 
with  q  —  2  only,  given  the  converged  nonsingular  W*  and  other  parameters 
p*{j\W*x),A*  as  well  as  V*  =  W*Ao,  denote  =  E\p*{j\W*x)xx'^]  and 

Rank[Rf  -  R^]  =  kr.  Then,  we  have  that  V*  =  UD  as  long  as  kr  >  Uo  -  I,  where 
n  *5  a  permutation  matrix  and  D  being  a  non- singular  diagonal  matrix.  Moreover, 
when  kr  <  no  —  I,  there  are  kr  row  vectors  of  V*  that  are  linear  independent  and 
each  of  them  contains  only  one  nonzero  elements. 

Proof  From  eq.(25b),  after  converged  we  have  W*RJW*^  =  Aj,3  =  1,2  with  Aj 
being  a  positive  diagonal  matrix.  From  RJ  =  AoR^A^  with  Ao  being  full  rank,  we 
know  Rank[Rf  -  Rf]  =  Rank[Rl  -  R|]  =  kr.  Also,  since  that  Rj  =  E[p*{j\W*Ao)ss'^] 
should  be  a  positive  diagonal  matrix  too,  we  know  that  Rf  -  R^  has  kr  nonzero  diagonal 
elements,  which  means  that  there  are  kr  corresponding  different  diagonal  elements  between 
R^  and  R|,  that  is,  R|(R^)“^  has  kr  different  diagonal  values.  Furthermore  from  Aj  = 
WMoRj(WMo)^  =  V*RW*^,  we  have  RfV*^  =  V^-^Ai,  and  V^-^Aa  =  R|V*^  = 
R|(Ri')-iRJV*^  =  R^(RJ)-^V*-Ui.  Let’s  denote  R  =  R|(R0"^  and  A  =  A2A~\ 
which  are  both  diagonal,  we  have  RV*^  =  V*'^A.  For  the  j-th  column  vector  of  or 
the  j-th  row  vector  of  V*,  we  have  RvJ  =  XjvJ  with  \j  being  the  j-th  digonal  element 
of  A  .  For  those  kr  different  diagonal  elements  of  R,  we  have  that  the  kr  corresponding 
vectors  vj's  are  linear  independent  with  each  containing  only  one  nonzero  element.  When 
no  -  kr  =  1,  the  remaining  vj  should  also  contain  only  one  nonzero  element.  Therefore, 
we  have  V*  =  UD  as  long  as  the  rank  kr  >  no  -  1.  When  kr  <  Uo  -  1,  since  there  are 
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no  —  kr  diagonal  elements  of  R  is  1,  their  corresponding  rio  —  fcr  vectors  for  vj  can  be 
linear  independent  but  each  contains  more  than  one  nonzero  elements.  Q.E.D. 

According  to  this  theorem,  when  the  source  s  is  gaussian,  we  have  the  learned 
and  kr  =  0,  the  above  Learned  Gaussian  Mixture  algorithm  will  fail.  When  there  are  Uo  —  kr 
sources  in  s  are  gaussian,  then  Rank[R^  —  =  kr  and  the  algorithm  can  not  recover  the 

rio  -  kr  gaussian  sources  and  the  other  kr  sources  can  be  recovered.  If  there  is  at  most  one 
gaussian  source  in  s,  it  usually  will  lead  to  Rank[Rf  —  R^]  =  kr  >  rio  —  1  and  we  have 
V*  =  Y\D.  This  theorem  has  justified  the  experimental  successes  given  in  [l0,ll]. 

This  theorem  has  also  provided  an  implementable  way  for  checking  whether  the  ob¬ 
tained  result  is  fully  successful  or  partially,  or  fail  through  checking  Rank\R^  —  R^\  —  kr- 
In  practice,  numerical  error  or  other  fact  may  affect  the  accurate  of  the  algorithm  and  the 
estimate  of  Rank[Rf  —  R^]  =  kr.  For  reliability,  we  can  use  a  gaussian  mixture  with  g  >  2. 
In  this  case,  we  can  get  the  following  Theorem  2  based  on  the  above  theorem. 

Theorem  2  For  the  problem  eq.(l)  and  using  the  hatch  algorithm  eqs.(25a&b) 
with  q  >  2,  given  the  converged  nonsingular  W*  and  other  parameters 

p*[j\W*x),A*  as  well  as  V*  =  W*Ao,  denote  R^  =  E[p*{3\W*x)xx'^]  and  kr  is  the 
rank  of  S  with  S  =  and  Stj  being  the  subspace  spanned  by  the  column 

vectors  of  Rf  -  Rj .  Then,  we  have  that  V*  —  UD  as  long  as  kr  >  no  —  I,  where 
n  ^5  a  permutation  matrix  and  D  being  a  non-singular  diagonal  matrix.  Moreover, 
when  kr  <  no  —  1,  there  are  kr  row  vectors  of  V*  that  are  linear  independent  and 
each  of  them  contains  only  one  nonzero  elements. 

Theorem  2  suggests  that  as  long  as  q  is  large  enough,  we  can  fully  success  if  there  is  at 
most  one  gaussian  source  in  s;  otherwise,  we  can  still  recover  those  non-gaussian  sources, 

Next,  we  give  the  second  algorithm  for  the  problem  eq.(l),  which  is  ob¬ 
tained  by  modifying  AVK  by  W/IWW'^  in  eq(18),  we  have 

AW=  r]{W  -  WE[{A-^Wx  -  {Wxf  +  {Wxf}{Wx)'^]},  (26) 

for  which  we  have 

Theorem  3  For  the  problem  eq.(l)  and  assuming  that  the  batch  algorithm 
eqs.(26)  converged  with  nonsingular  W*  and  V*  =  W*Ao.  Denote  R{  = 
<itagf[£;{s{),  ■  • '  ,JS(s^)l,i  =  2,3,4.  Then,  we  have  V*  =  UD  as  long  as  is 
full  rank  and  of  faiagih^^^ F[{W* x)^{W* x)'^]}  ^  offdiag{hjE[{W*xy{W*x)'^]}  unless 
offciiag{E[iW*x)^{W*x)'^]}  =  ofU,g{E[{W*xy{W*x)'^]}  =  0,  where  j^ke  [2,3,4] 
and  of  fdiagi^]  denotes  all  the  off-diagonal  elements  of  A. 

Proof  From  eq.(26),  after  converged  we  have  /  =  A2^W* Exx'^W*'^  — 

A~^ E[{W*xy{W*x)'^]  +  A“^£J[(W*a;)^)(VFa:)^],  and  under  the  condition  of  the  theo¬ 
rem,  we  have  E[{W*xy{W*x)'^]  =  D^,  for  k  =  2,3,4  with  Dk  being  diagonal.  First, 
from  E[{W* Ao)ss'^ {W*Ao)'^]  =  V*R^V*'^  =  D2  with  R\  being  a  positive  definite 
diagonal  matrix,  and  we  have  V*  =  with  =  /.  Second,  from 

E[{W*xy{W*x)'^]  =  E[{V*sy{V*x)'^]  =  D3,  we  have  ^  ^ 

diagonal  and  =  [v^j]  (i.e.,  its  each  element  is  the  square  of  ofV  =  [vij]).  Moreover, 
with  V*  =  we  have  =  [v^j]  =  Dl[(f)^^j](Rl)~^  with  0“^  =  [(j)ij] 

and  =  /  as  given  above.  Therefore  =  D, 

or  [4>‘^  ^]R^  =  D'  with  R  and  D'  being  both  diagonal.  It  further  follows  that  <!•  = 
j]~^  D'  and  /  =  =  D'[(f)‘^j]~'^  R~'^[(f)fj]'~^  D'  or  by  inverting  it  becomesD  ^  = 

j]^ ^  which  means  [<f>^ j]  =  D'^R~^  with  =  I.  Thus,  D'  —  [0?^]//$  = 

D'<SR~^R^  or  =  /,  That  is,  $  =  and  [(f>^j]R  =  D'^'^  =  D'[4>i,j],  which  means 
that  rt(iy  j  =  thus  either  <t>ij  =  0  or  4>ij  —  d^/vi,  with  n  and  d'-  being  the  diagonal 

element  of  R,  D'  respectively.  Therefore,  =  RED*  and  each  element  of  E  can  only  to 
be  0  or  1.  It  follows  from  /  =  =  RED'^E'^R  that  R-^  =  ED'^E'^  =  d^'^aej 

with  ei  is  the  column  of  E.  Thus,  the  off-diagonal  elements  of  eief  must  be  zeros,  or 
equivalently  E  is  a  permutation  matrix  =  EERED*  =  EDr  with  DrERED'  being 
still  diagonal.  QED. 

The  above  theorem  suggests  a  way  to  improve  the  algorithm  eq.(26)  via  imposing  some 
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constraints  to  enhance  the  chance  of  satisfying  the  condition  of  the  theorem.  Due  to  the 
limited  space,  we  leave  the  details  elsewhere. 

Not  only  the  above  algorithms  given  by  eq.(25)  and  eq.(26)  have  extended  the 
existing  information-theoretic  approaches  [1,2, 3, 4],  but  also  the  algorithms  given 
by  eq.(20),  eq.(21)  and  eq.(22)  can  be  also  regarded  as  the  extended  information- 
theoretic  approaches  for  the  cases  that  the  number  of  sensors  are  larger  than  the 
unknown  number  of  sources.  Finally,  all  these  approaches  are  the  special  cases  of 
the  BKYY  ICA  given  in  Sec.  3  and  Sec. 4  for  the  much  general  cases  given  by 
eq.(5).  Furthermore,  we  can  also  prove  the  similar  theorems  as  given  by  Theorems 
1,2,3.  In  addition,  we  can  also  prove  that  the  obtained  will  be  the  true  for 
the  algorithms  in  Sec. 4  .  Due  to  limited  space,  we  leave  these  proofs  to  [14]. 

6.  Bayesian  Convex  Divergence  (BCYY)  ICA  Scheme 

We  go  back  to  consider  eq.(7b)  by  using  the  convex  divergence  eq.(3).  Due 
to  limited  space,  we  leave  the  general  case  elsewhere,  here  we  only  consider  the 
special  case  n  —  m  =  Uo  and  =  0.  In  this  case,  we  have  y  —  s  and 
PMj  (s|2/)  =  -  y)-  8{y  ~  5)  =  pMi  (yk)-  Thus,  eq,(7b)  will  becomes 

=  argm\n{w,{gi}eQ}Jiy^A9i}), 

fa})  =  -  /,  Ph{x)S{  )da:,  =  (27) 

Through  the  density  transformation  p(y)  =  p/i(s)/|W|,  we  can  also  get 

m,  faj)  =  -  4  "'*)<<!/,  (28a) 

which  can  be  regarded  as  a  generalized  version  of  the  MMI  [3].  Moreover,  we  trans¬ 
form  it  further  to  z  —  s(y)  by  a  sigmoid  monotonic  function  Si(r)  =  9i{yi)dyi, 
resulting 

J(W)  =/(!)- I  (286) 

That  is,  we  have  reached  a  generalized  version  of  the  INFORM  AX  [4]. 

We  consider  the  special  case  /(u)  =  u^,  0  <  /?  <  1.  From  eq.(27),  we  have 

=  /(i)  -  (29) 

From  we  get  both  the  batch  way  and  adaptive  algorithms  for 

updating  W  by 

AW  =  -r7/?|det  W!^{F[p-^(a:)(nLi  +  k{Wx){Wx)'^)]W}, 

AW=  -0\detW\l^{fl''^^^g^{x*wi))l^iW  h{Wx){Wx)'^W),  (30) 

where  h  is  the  same  as  in  eq.(lOa).  We  have  removed  in  the  adaptive 

equation  to  save  the  computation  onph(^^).  It  is  interesting  to  compare  eq.(30)  with 
eq.(24b).  We  find  that  the  moving  step  is  modulated  by  a  scalar  C(|  det  Wj, y, /?)  = 
/?(]  det  W|  •  This  change  has  one  effect  at  least.  Since  gi{x^Wi)  is 

small  when  is  large,  we  have  C{\detW\,y,(3)  becomes  relatively  small  for 

large  In  other  words,  the  algorithm  eq.(30)  should  be  more  robust  to  the 

affects  by  outliers. 

To  verify  the  effect,  we  use  the  fixed  hi{si)  =  —sf  as  in  [10]  which  is  shown  that 
the  algorithm  eq.(24b)  can  success  on  sub-gaussian  sources.  In  Fig. 2,  shown  in  the  1st 
row  is  the  result  of  a  problem  of  2  channels  with  sources  from  sub-gaussian  uniform (-1,1) 
signal  plus  5%  outliers.  Shown  in  the  2nd  row  is  the  result  of  2  channels  with  sources 
from  sub-gaussian  beta(0.5,0.5)  signal  plus  5%  outliers.  Shown  in  the  3rd  row  is  the  result 
of  a  problem  of  3  channels  with  sources:  one  from  the  above  uniform(-l,l)  with  outliers, 
one  from  the  above  beta(0.5,0.5)  with  outliers,  and  one  from  the  super-gaussian  permuted 
speech  signal.  Their  histograms  are  listed  in  the  1st  column  from  top  down.  Shown  in  the 
2nd  and  3rd  columns  are  results  by  the  algorithms  eq.(30)  and  eq.(24b),  respectively. 

For  the  first  two  experiments  (the  first  two  rows),  the  cubic  nonlinearity  hi{si)  =  -s^ 
is  used.  The  algorithm  eq.(24lA  given  by  [3,4]  fails  because  the  two  channels  of  sources  are 
actually  super-gaussian  now.  The  results  are  consistent  to  the  existingTheoretical  results 
[15]  that  it  cannot  perform  separation  for  super-gaussian  signals.  However,  interestingly 
the  algorithm  eq,(30)  with  0  =  0,5  successes.  Trials  with  0  =  0.8,  0.2  also  show  similar 
results.  Hence,  it  is  demonstrated  that  the  algorithm  eq.(30)  is  indeed  more  robust  than 
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Histogram  IMew  Algorithm  Old  Algorithm 


Figure  1  The  experimental  comparisons  on  several  types  of  source  signals 


the  algorithm  eq.(24b)  to  outliers.  The  third  experiment  (the  3rd  row)  used  the  learned 
mixture  of  densities  as  used  in  [10,11]  with  gi{si)  learned.  Now  both  the  BKYY-ICA  and 
BCYY  ICA  algorithm  work  well. 
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Abstract 

The  problem  of  adapting  linear  Multi-Input-Multi-Output 
systems  for  unsupervised  separation  of  linear  mixtures  of 
sources  arises  in  a  number  of  signal  processing  applications. 
In  this  paper  we  present  a  new  single  layer  neural  network 
in  which  information  transfer  maximization  is  equivalent  to 
minimizing  a  cost  function  involving  the  well-known  Constant 
Modulus  criterion  originally  used  in  blind  equalization.  The 
proposed  approach  is  able  to  separate  sources  with  negative 
kurtosis  as  revealed  by  an  analysis  of  the  cost  function  sta¬ 
tionary  points.  Two  learning  rules  are  presented  to  compute 
the  optimum  separating  matrix.  One  of  them  turns  out  to  be 
an  equivariant  algorithm  whose  convergence  does  not  depend 
on  the  mixture  matrix. 


1  Problem  Statement 

Adapting  linear  Multi-Input  Multi-Output  (MIMO)  systems  to  separate  lin¬ 
ear  mixtures  of  signals  is  a  problem  that  frequently  arises  in  signal  processing 
applications  such  as  array  processing,  multiuser  detection,  linear  feature  ex¬ 
traction,  etc  ...  The  blind  source  separation  problem  can  be  formulated  as 
follows.  Let  us  consider  an  array  of  sensors  that  provides  a  vector  of  obser¬ 
vations  X  —  [xi,  12,  "  * ,  xi^Y  which  is  a  linear  mixture  of  a  vector  of  sources 
s  =  [si,S2,  •  '-.snY 

X  =  As  (1) 

A  represents  the  N  x  N  mixture  matrix.  Both  the  sources  and  the  mixture 
matrix  are  unknown.  The  only  assumptions  we  will  make  in  our  model  are 
A  is  a  non-singular  matrix  and  the  sources  are  zero-mean,  statistically  in¬ 
dependent,  non-gaussian  random  processes.  Without  loss  of  generality,  we 

*Thls  work  has  been  supported  by  Xunta  de  Galicia  (grant  XUGA  10502A96)  and 
CICYT  (grant  TIC  96-0500-C10-02) 
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can  assume  that  sources  have  unit  variance  since  power  differences  can  be 
incorporated  in  matrix  A. 

To  recover  the  sources,  x  is  processed  through  a  linear  memoryless  MIMO 
system,  represented  hy  a,  N  x  N  matrix  W,  to  produce  an  output  vector 

y  = 

y  =  W^x  (2) 

The  superindex  ^  denotes  transpose.  Combining  both  (1)  and  (2)  together, 
we  get 

y  =  Gs  (3) 

where  G  =  W^A  is  the  matrix  representing  the  overall  mixing/separating 
system.  The  objective  in  source  separation  is  to  select  W  in  order  that  each 
output  corresponds  to  a  single  and  different  source  Si  up  to  some  fixed  gain. 
When  this  occurs,  G  can  be  expressed  as  the  product  of  a  diagonal  matrix 
A  and  a  permutation  matrix  P,  i.e.,  G  =  AP- 

A  basic  principle  to  solve  the  blind  source  separation  problem  is  pro¬ 
vided  by  the  Darmois-Skitovich  theorem  [1]:  if  s  is  a  vector  of  statistically 
independent  non-gaussian  signals  and  y  =  Gs,  y  is  a  vector  of  statistically 
independent  signals  if  and  only  if  G  =  AP.  Therefore,  one  way  to  recover 
the  sources  is  to  select  W  in  order  to  minimize  the  statistical  dependence 
among  the  components  in  y.  This  is  referred  in  the  literature  as  Independent 
Component  Analysis  (ICA)  [2]  and  a  number  of  both  block  processing  [2]  and 
adaptive  processing  [3,  4,  5,  6]  methods  have  been  proposed.  In  the  sequel, 
we  will  focus  our  attention  into  adaptive  methods  since  they  are  easier  to 
implement  and  more  related  to  the  field  of  neural  networks. 

Recently,  several  adaptive  algorithms  for  blind  source  separation  have 
been  developed  in  the  context  of  unsupervised  learning  of  neural  networks 
[7].  The  separating  MIMO  system  is  then  interpreted  as  the  linear  part  of  a 
single  layer  nonlinear  neural  network  (see  figure  1).  In  this  model,  matrix  W 
represents  the  synaptic  weights  and  g{‘)  the  activation  function.  The  vector 
of  the  outputs  after  the  nonlinearities  Ui  =  g{yi),  i  =  1,  •  •  * ,  A  is  denoted  u. 


single  layer  non-linear  neural  network 


mixture  stage  separating  stage 


Figure  1:  The  model. 

Trying  to  understand  the  way  perceptual  systems  work,  an  unsupervised 
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learning  paradigm  called  the  infomax  principle  [8]  has  been  proposed.  Ac¬ 
cording  to  this  principle  the  parameters  of  a  neural  network  should  be  cho¬ 
sen  to  maximize  the  information  transfer  between  the  input  and  the  out¬ 
put.  Nadal  and  Parga  [9]  have  shown  that  if  the  activation  functions  are 
continuous,  increasing,  invertible  and  bounded  nonlinearities,  information 
transfer  between  x  and  u  is  maximized  when  the  components  in  u  are  sta¬ 
tistically  independent  and  have  a  uniform  distribution.  This  result  suggests 
that  statistical  dependence  between  outputs  tends  to  reduce  when  maximiz¬ 
ing  information  transfer.  However,  this  is  not  always  true  since  simulations 
reported  in  [7]  show  that  information  transfer  maximization  algorithms  only 
perform  blind  source  separation  when  the  sources  have  positive  kurtosis.  In 
many  applications  (communications  and  image  processing,  for  instance)  sig¬ 
nals  typically  have  negative  kurtosis  and  the  algorithms  in  [7]  do  not  work 
adequately. 

In  this  paper  we  present  a  new  information  transfer  maximization  algo¬ 
rithm  suitable  for  sources  with  negative  kurtosis.  The  algorithm  can  also  be 
interpreted  as  a  generalization  of  the  Constant  Modulus  Algorithm  (CMA) 
[10],  therefore  reinforcing  the  link  between  information-theoretic  unsuper¬ 
vised  learning  paradigms  and  blind  adaptive  filtering.  Section  2  presents 
the  information  transfer  maximization  criterion.  Section  3  introduces  two 
learning  rules.  Section  4  presents  some  simulation  experiments  and  section  5 
contains  the  conclusions. 


2  Information  Transfer  Maximization 


Let  US  start  by  considering  a  single  layer  nonlinear  neural  network  comprising 
a  linear  part  and  a  new  fixed  activation  function  defined  as  follows 

g{x)  =  /  exp{-{t^  -  lf)dt  (4) 

J  —  oo 

Figure  2  shows  the  plot  of  g{x)  and,  similarly  to  other  well-known  activation 
functions,  it  is  a  continuous,  increasing,  bounded  and  invertible  nonlinearity. 

Following  the  infomax  principle  [8],  we  select  the  synaptic  weights  W 
in  order  to  maximize  the  information  transfer  between  the  input  x  and  the 
output  after  the  nonlinearity  u,  that  is 


7(x,u)  =  E 


f,  p»,^(x,u)  1 

1  p.(xK(u)J 


(5) 


where  £;{■}  denotes  expectation,  and  Pi,(x),  Pti(u)  and  pa;^u(x, u)  are  the 
probability  density  functions  (p.d.f.)  of  x,  u  and  the  pair  (x,u)  respec- 
tively.  Taking  into  account  that  the  entropy  of  the  output  u  is  H{u)  = 
£'{“lnpu(u)}  and  that  the  entropy  of  u  conditioned  to  the  input  x  is 
i7(u|x)  =  {— lnpu,a;(u|x)},  (5)  is  equivalent  to 


/(x,u)  =  7f(u)-7f(u|x) 


(6) 
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Figure  2:  Nonlinear  transfer  function. 


Now,  since  the  relationship  between  u  and  x  is  deterministic,, ii/'(u|x)  =  0. 
Therefore,  maximizing  the  information  transfer  is  equivalent  to  maximizing 
the  output  entropy  [9]. 

Next,  we  will  express  H{u)  in  terms  of  the  input  entropy.  Provided  that 
W  is  a  square  and  invertible  matrix  and  the  nonlinearity  in  (4)  is  invertible, 
u  has  a  p.d.f.  given  by 

,.(»)  =  — —  m 

\det  W1  Hs'Cw) 


where  y  =  W^x.  Therefore, 

N 

ff(u)  =  i7(x)+^  ^;{ln  s'(y.)}  +  ln  |dei  W|  (8) 

where  H  (x)  is  the  input  entropy.  Particularizing  for  the  nonlinearity  (4) 

N 

fl-(u)  =  ff(x)-^£{(y?-l)"}+ln  |de<  W|  (9) 

2  =  1 

Since  the  input  entropy  ^^(x)  does  not  depend  on  W,  we  conclude  that 
maximizing  H{u)  (or  equivalently  maximizing  the  information  transfer)  is 
equivalent  to  minimizing  the  cost  function 

^(W)  =  Y,  E{{yf  -  1)'}  -  In  \det  W|  (10) 

2  =  1 

This  is  an  important  result  because  this  cost  function  also  admits  another 
interesting  interpretation  from  the  perspective  of  blind  adaptive  filtering. 
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The  first  part  of  (10)  is  the  extension  to  MIMO  systems  of  the  well-known 
Constant  Modulus  (CM)  criterion  [10]  widely  used  in  blind  equalization.  The 
analysis  carried  out  in  [11]  for  a  Multiple-Input-Single-Output  (MISO)  system 
whose  output  is  yi  =  wfx  shows  that  if  the  kurtosis  of  all  the  sources  is 
negative,  the  only  existing  minima  of  the  criterion  E{{y'i  —  1)^}  correspond 
to  points  where  a  single  source  is  extracted.  This  means  that  if  a  MIMO 
system  is  adjusted  according  to  the  first  part  of  (10),  each  output  yi  will 
extract  a  single  source.  However,  there  exists  the  possibility  that  the  same 
source  is  extracted  at  different  outputs.  This  situation  is  prevented  by  the 
existence  of  the  second  term  in  (10)  because,  when  it  occurs,  two  columns  of 
W  are  proportional  and  the  second  part  of  (10)  grows  very  large. 

The  ability  of  our  approach  to  perform  source  separation  is  further  sup¬ 
ported  by  an  analysis  of  the  stationary  points  of  ^(W)  presented  in  [12].  A 
simple  situation  of  a  two  sources  mixture  and  a  two-inputs-two-outputs  neural 
network  was  assumed.  The  analysis  consisted  in  finding  the  points  where  the 
gradient  vanishes  and  examining  the  positive  definiteness  of  the  Hessian  ma¬ 
trix  at  these  points  to  determine  whether  they  are  maxima,  minima  or  saddle 
points.  It  was  possible  to  prove  that  the  points  W  where  source  separation  is 
achieved  correspond  to  minima  if  the  kurtosis  of  the  sources  is  negative.  The 
analysis  turned  rather  involved  when  trying  to  show  that  <^(W)  does  not 
contain  undesirable  stationary  points.  Nevertheless,  computer  simulations 
never  revealed  such  undesirable  equilibria  points. 


3  Learning  Rules 

In  this  section  we  discuss  different  learning  rules  to  compute  the  coefficients 
of  the  separating  matrix  that  minimize  ^(W).  The  first  possibility  is  to  use 
a  gradient  descent  algorithm  of  the  form 

W(n  +  1)  =  W(„)  -  (11) 

where  /z  is  the  algorithm  step-size.  Taking  (2)  into  account  and  that  det  W  = 
for  any  row  i  (cofwij  being  the  cofactor  of  Wij)  we  have 


d<l> 

dwii 


=  AE{xiy^^}  -  AE{xiyj}  - 


cofw. 


tj 


det  W 


(12) 


Dropping  the  expectation  operator,  the  resulting  stochastic  gradient  descent 
algorithm  reads 


W(rz -h  1)  =  W(n) -H  Az  ^4  xy^  -  4  xy^D(y)  -f  [W^(n)]  (13) 

where  D(y)  =  Diag[yl,yl,  •  •  2/^]-  As  discussed  in  [11],  the  term  -|-xy^  in 

(13)  is  a  typical  anti-Hebbian  term.  The  term  — xy^D(y)  has  the  twofold 
effect  of  stabilizing  the  algorithm  and  incorporating  high  order  statistics  to 
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reach  output  independence  (it  is  well-known  that  the  anti-Hebbian  rule  is 
by  itself  unstable  and  reaches  output  decorrelation,  not  independence).  It 
constitutes  an  improved  version  of  the  adaptive  Oja’s  algorithm  [13]  for  prin¬ 
cipal  component  analysis.  Finally,  the  term  [W^j  precludes  convergence 
towards  a  singular  matrix  W. 

In  order  to  avoid  computation  of  an  inverse  matrix  at  each  update,  a 
second  choice  as  a  learning  rule  is  a  relative  gradient  descent  algorithm  [5]  of 
the  form 

W(n  +  1)  =  W(n)  -  (14) 

As  long  as  W(n)  remains  a  nonsingular  matrix,  it  is  easily  shown  that  this 
form  of  adaptation  always  reduces  ^(W)  [5].  From  (13)  it  is  apparent  that 
the  resulting  stochastic  relative  gradient  algorithm  is 

W(n  +  1)  =  W(n)  +  //W(n)  (4  yy^  -  4  yy^D(y)  +  l)  (15) 

where  I  is  the  identity  matrix.  Again,  each  term  in  (15)  has  a  clear  interpreta¬ 
tion  from  the  perspective  of  source  separation.  The  term  4*yy^  forces  decor¬ 
relation  between  the  different  outputs  whereas  the  second  term  — yy^D(y 
involves  higher  order  statistics  and  forces  the  stronger  condition  of  indepen¬ 
dence  between  different  outputs.  Finally,  the  identity  matrix  I  prevents  the 
algorithm  to  converge  towards  a  solution  where  all  the  outputs  are  equal  to 
zero. 

Although  motivated  by  the  necessity  of  avoiding  matrix  inverse  computa¬ 
tions,  the  learning  rule  (15)  exhibits  the  more  interesting  property  of  being 
an  equivariant  algorithm  [5].  Premultiplying  (15)  by  it  is  obtained  that 
the  combined  mixing/separating  system  evolves  under  the  following  updating 
rule 


G^(n  +  1)  =  G^(n)  +  ^iG^(n)  (4  yy^  -  4  yy’”D(y)  +  l)  (16) 

that  does  not  depend  explicitly  on  the  mixing  matrix  A.  As  a  consequence, 
(15)  possess  the  equivariance  property  because  the  time  evolution  of  the 
global  system  is  independent  of  A:  it  only  depends  on  the  initial  conditions 
and  the  statistical  characteristics  of  the  sources  s.  It  will  perform  adequately 
even  though  A  is  an  ill-conditioned  matrix. 

4  Computer  Simulations 

In  this  section  we  present  the  results  of  some  computer  experiments  carried 
out  to  illustrate  the  performance  of  the  equivariant  learning  rule  (15)  and 
establish  comparisons  with  existing  approaches.  As  sources,  we  consider  three 
images  with  256  x  256  pixels  having  a  normalized  kurtosis  of  —1.42,  —0.75 
and  —0.51.  These  images  can  be  seen  in  the  left  column  of  figure  3. 


491 


In  the  first  simulation  experiment  we  considered  the  following  mixture 
matrix 

■  -1  -1  1  ■ 

A  =  1  11  (17) 

_  1  -11. 

The  resulting  mixed  observations  are  plotted  in  the  middle  column  of  figure 
3,  To  recover  the  sources  from  the  observations,  a  3  x  3  MIMO  system 
is  considered  whose  coefficients  are  updated  according  to  the  equi variant 
algorithm  (15).  To  reduce  the  misadjustment  noise,  we  chose  a  variable  step- 
size  that  starts  from  fx  =  b  x  10“^  and  is  multiplied  by  0.6-  each  10.000 
iterations.  The  right  column  of  figure  3  shows  the  outputs  corresponding  to 
the  separating  system  obtained  after  65.536  iterations  which  is  the  size  of  a 
256  X  256  image.  It  is  clearly  seen  that  our  approach  was  able  to  successfully 
recover  the  original  sources. 

In  order  to  measure  the  performance  of  our  algorithm  and  make  compar¬ 
isons  with  existing  approaches  we  define  the  following  index  [11]  which  is  zero 
iff  G  =  W^A  corresponds  to  source  separation 


Kw)  =  E  E 


N  /  N  -1  \  N  /  N  2 

_ 1 

\^i  I  yti  max,g^j 


-  1 


(18) 


Figure  4  plots  the  time  evolution  of  this  performance  index  for  the  proposed 
algorithm  (15),  the  EASI  algorithm  with  a  cubic  nonlinearity  [5],  Bell  and 
Sejnowski  [7]  (BS)  and  Cichocki  and  Unbehauen  [4].  These  three  latter  were 
implemented  with  a  variable  step-size  strategy  similar  to  the  one  described 
above.  It  is  apparent  that  our  approach  performs  almost  the  same  as  the 
EASI  algorithm  whereas  outperforms  BS  and  CU. 

Finally,  to  test  the  equi  variance  property  of  the  learning  rule  (15),  we 
carried  out  a  second  simulation  experiment  considering  the  ill-conditioned 
mixture  matrix 

■  1.01  1  1 

A=  1  1.01  1  (19) 

1  1  1.01  _ 

Figure  5  plots  the  performance  index  for  the  same  algorithms  as  before  show¬ 
ing  again  that  the  EASI  algorithm  and  ours  exhibit  superior  performance 
than  BS  and  CU.  In  addition,  the  speed  of  convergence  has  remained  un¬ 
changed. 


5  Conclusions 

Existing  information  transfer  maximization  algorithms  [7]  that  use  conven¬ 
tional  activation  functions  are  only  capable  of  separating  sources  with  positive 
kurtosis  and  therefore  cannot  be  used  in  communications  or  image  processing 
applications  where  signals  typically  have  negative  kurtosis.  This  paper  over¬ 
comes  this  limitation  by  presenting  a  single  layer  neural  network  with  a  new 


activation  function.  It  is  shown  that  maximizing  its  information  transfer  is 
equivalent  to  minimizing  a  statistical  criterion  that  involves  the  well-known 
Constant  Modulus  criterion  [10]  originally  used  for  blind  equalization.  Two 
learning  rules  have  been  derived,  a  conventional  gradient  descent  rule  and  a 
relative  or  natural  gradient  descent  rule.  The  latter  turns  out  to  be  an  equiv- 
ariant  algorithm  whose  performance  is  independent  of  the  mixture  matrix. 
Finally,  simulations  show  that  our  approach  performs  the  same  or  better  than 
existing  blind  source  separation  adaptive  algorithms  when  applied  to  sources 
with  negative  kurtosis. 
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Figure  3:  Blind  separation  of  three  images  with  the  proposed  algorithm. 
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Three  sources 


Figure  4:  Performance  index  for  the  first  computer  experiment  with  the  well- 
conditioned  mixture  matrix  (17). 


Three  sources 


Figure  5:  Performance  index  for  the  second  computer  experiment  with  the 
ill-conditioned  mixture  matrix  (19).- 
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Abstract 

Blind  source  separation  and  blind  output  decorrelation  are  two  well-known  prob¬ 
lems  in  signal  processing.  For  instantaneous  mixtures,  blind  source  separation  is 
equivalent  to  a  generalized  eigen-decomposition,  while  blind  output  decorrelation 
can  be  considered  as  an  iterative  method  of  output  orthogonalization.  We  propose  a 
steepest  descent  procedure  on  a  new  cost  function  based  on  the  Frobenius  norm 
which  measures  the  diagonalization  of  correlation  matrices  to  perform  blind  source 
separation  as  well  as  blind  decorrelation.  The  method  is  applicable  to  both  station¬ 
ary  and  nonstationary  signals  and  instantaneous  as  well  as  convolutive  mixture 
models.  Simulation  results  by  Monte  Carlo  trials  are  provided  to  show  the  consis¬ 
tent  performance  of  the  proposed  algorithm. 


1.  Introduction 

The  field  of  blind  signal  processing  which  includes  blind  source  separation,  blind 
decorrelation  and  blind  equalization  has  recently  received  a  lot  of  attention.  Most  of 
the  blind  separation  algorithms  are  based  on  high-order  statistical  information 
because  it  can  be  shown  that  second  order  statistics  are  not  sufficient  to  uniquely 
separate  sources  [15].  However,  this  proof  requires  signal  stationarity  which  is  not 
applicable  to  many  important  real  life  problems  (such  as  separation  of  speech). 
There  have  been  also  reports  showing  experimentally  that  systems  based  on  second 
order  statistics  can  indeed  separate  mixed  sources  [l]-[9].  These  methods  explore 
the  time  characteristics  of  the  covariance,  i.e.  either  nonstationary  temporal  esti¬ 
mates  of  covariance  matrices  or  time-delayed  cross-correlation  matrices.  Blind 
decorrelation  can  be  formulated  with  second  order  statistics  and  can  be  solved  by 
the  orthogonalization  of  covariance  matrices  [11,  12].  Therefore  methods  based  on 
second  order  statistics  for  blind  source  separation  are  related  to  blind  decorrelation, 
but  there  is  no  systematic  coverage  of  the  two  areas.  In  this  paper,  we  will  unify  the 
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framework  of  blind  source  separation  using  second  order  statistics  with  blind  deco¬ 
rrelation  and  apply  the  new  algorithm  to  speech  data. 


The  signal  separation  and  decorrelation  problems  can  be  modeled  as 


m  =  = 


=  AS(t) 


(1) 


where  ^(t)  is  an  by  1  zero-mean  vector  containing  the  original  signal,  ^(t)  is  an  M 
by  1  coupled  signal  vector  {M  >=  N),  A  isdinMhyN  real  time-invariant  coupling 
matrix,  .yj(t)  is  the  unknown  signal  and  is  a  real  unknown  coupling  coeffi¬ 
cient.  If^  is  full  rank,  we  can  always  use  PCA  to  find  the  subspace  ofX(t)  which  is 
equivalent  to  the  space  of  5'(t).  The  objective  of  blind  decorrelation  is  to  design  a 
full-rank  weight  matrix  W  that  constructs  an  output  r(t)  =  W^X(t)  displaying  a  diag¬ 
onal  output  covariance  jE'{y(t)y^(t)}.  However,  blind  separation  has  a  different 
objective  since  one  needs  to  design  a  signal  separator  which  is  able  to  reconstruct 
the  original  signals.  The  latter  constraint  will  put  a  stronger  requirement  on  the 
weight  matrix  W  such  that 


=  A-^ 


or 


W  A  =  PD 


(2) 


where  P  is  a  permutation  matrix,  Z)  is  a  diagonal  scaling  matrix  and  T  denotes 
matrix  transpose  [13].  The  weight  matrices  for  blind  separation  belong  to  a  subset 
of  the  blind  decorrelation  solution. 


2.  Approach  and  criterion 

According  to  [4,  5],  blind  separation  of  nonstationary  signals  can  be  formulated  as 
the  simultaneous  diagonalization  of  two  covariance  matrices  estimated  at  different 
times,  which  can  be  further  reduced  to  a  generalized  eigen-decomposition  problem 
(requiring  off-line  processing).  Orthogonalization  of  covariance  estimates  at  many 
time  instants  with  regularization  was  suggested  in  [6]  as  an  on-line  algorithm  for 
blind  separation.  However,  the  method  can  only  decorrelate  signals  with  positive- 
definite  covariance  matrices  due  to  a  restriction  placed  on  the  cost  function.  Instead 
of  using  different  covariance  estimates  at  different  time  intervals,  one  can  still  per¬ 
form  blind  separation  (decorrelation)  through  the  simultaneous  orthogonalization 
of  two  or  more  time-delayed  correlation  matrices  [7,  8]  as 

D(q)  (3) 

where  /)(q)  is  a  diagonal  matrix  associated  with  a  delay  q.  The  non-symmetrical 
time  delayed  correlation  matrix  £'{A(t)#(t-q)}  is  not  necessarily  positive-definite 
and  hence  we  cannot  apply  the  criterion  proposed  by  [6].  To  our  knowledge  only 
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the  generalized  eigen-decomposition  was  proposed  to  solve  this  formulation  but  the 
method  is  an  analytic  solution  [7,  8].  Here  we  will  propose  an  alternative  criterion 
to  solve  the  realistic  case  of  non-positive-defmite  time-delayed  correlation  matrices 
iteratively. 


The  idea  is  to  create  a  cost  function  which  minimizes  directly  the  difference 
between  the  quadratic  form  in  the  left  side  of  Eq.  (3)  and  its  right  side.  In  order  to 

measure  the  distance  between  the  correlation  estimate  (t-q)}  and  its  diago¬ 

nal  version,  E>q(t)  we  propose  the  following  criterion: 


d  d 

I  ./q(t)  =  I 

q  =  0  q  =  0 

as  well  as 


2 

F 

2 

F 


(4) 

(5) 


where  II  II  -  denotes  the  Frohenius  norm  and  d  is  the  total  number  of  the  delayed 

II  II 

covariance  matrices  Dq(t)  we  need  to  constrain.  However  the  choice  of  d  still  needs 
to  be  further  analyzed.  The  Frobenius  norm  is  defined  as  [14]: 

Jm  n 

SSIa,/  (6) 


where  A  is  an  m-hy-n  matrix.  The  minimum  of  this  cost  function  preserves  the  diag¬ 
onal  elements  of  £{X(t>Y^(t-q)}  but  zeros  all  other  elements  out,  i.e.,  it  solves  both 
the  blind  separation  and  decorrelation  problems.  The  cost  function  of  Equation  (4) 
is  a  nonnegative  fourth-order  function  of  the  weight  coefficients  W.  The  minimum 


d 

can  be  obtained  using  a  gradient  descent  procedure  aIF  =  -q  X 

q  =  0 


Simple 

dW 


linear  algebra  manipulations  yield 
d 


AW  =  -n  I  {[C^(t)  +  Cl(t)]fV[W^C^(tW-D^(t)]}  (7) 

q  =  0 

where  C^Ct)  =  £{X(t)X^(t-q)}  (8) 


In  addition,  the  gradient  descent  procedure  using  our  proposed  criterion  will  move 
the  directions  of  the  output  signals  in  such  a  way  that  orthogonalizes  all  the  vectors 
in  tandem  as  depicted  in  Figure  1.  This  methodology  is  different  from  Gram- 
Schmidt  where  the  principal  component  must  stabilize  before  the  others  converge 
since  they  are  corrected  with  respect  to  it.  (yielding  what  is  called  the  deflation  pro- 
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cedure.)  Deflation  procedures  converge  sequentially  which  is  a  known  problem  for 
the  recent  prevalent  orthogonalization  procedures  based  on  the  L2  norm  or  Ray¬ 
leigh  quotient  optimization. 


Our  procedure  does  not  constrain  the  size  of  the  vectors,  so  null  vectors  can  occur 
during  learning.  This  corresponds  to  a  trivial  solution  that  meets  the  minimization 
of  the  criterion  of  Eq.  (4).  To  avoid  it,  we  have  to  impose  a  unit  length  constraint  on 
the  weight  vectors. 


3.  Implications  of  the  new  criterion 

For  the  case  Z)q(t)  =  I  and  q  =  0,  Eq.  (7)  yields, 

W(t  +  1)  =  (9) 

We  will  show  that  this  adaptation  rule  was  previously  utilized  in  blind  decorrelation 
and  independent  component  analysis. 

3.1  Stochastic  Whitening  Procedure 

If  the  requirement  is  to  obtain  whitened  outputs,  we  have  to  estimate  £^{y(t)F^(t)} 
either  using  an  exponential  window  or  batch  mode.  In  [12]  the  following  on-line 
stochastic  whitening  procedure  was  derived  through  a  Gram-Schmidt-like  orthogo¬ 
nalization, 

fF(t+l)=  W{X)-r\W{i)[Y{i)^{t)-I\  (10) 

which  is  equivalent  to  Eq.  (9)  if  the  first  £^(y(t)y^(t)}  is  dropped. 

3.2  Kullback-Leibler  Divergence 
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If  we  replace  7(t)  by  f(r(t))  in  Eq.  (10),  where  f(*)  is  a  proper  nonlinear  function, 
we  obtain  the  update  rule  derived  from  minimization  of  Kullback-Leibler  diver¬ 
gence  in  [10]. 

3. 3  Generalized  Blind  Decorrelation  Rule 

The  adaptation  rule  for  generalized  decorrelation  at  iteration  n  using  Eq.  (7)  with  an 
arbitrary  square  matrix  Cq(t)  is 

+  ^  ^  w\(i)>^n(0  =  1,  for  all  i  =1,  2, ...,  N,  (11) 

where  w^(i)  is  the  i*  column  vector  of  W^.  With  Eq.  (11),  we  can  extend  the  solu¬ 
tion  presented  in  [11]  to  any  time-delayed  correlation  matrix.  If  we  orthogonalize 
several  time-delayed  matrices  using  this  procedure,  we  can  obtain  separated  signals 
as  in  [7]. 

3.4  On-line  Adaptation  Rule  for  Blind  Separation  of  Nonstationary  Signals 
(Instantaneous  mixture) 

It  has  been  proved  in  [6]  that,  for  linear  time-invariant  instantaneous  mixture  of 
locally  stationary  signals,  the  source  signals  can  uniquely  be  determined  from  the 
sensed  signals  (except  the  arbitrariness  of  the  permutation  matrix  P  and  the  diago¬ 
nal  matrix  D)  if  and  only  if  Eq.  (12)  holds 

E{y^{i)yj{i))  =  0  for  all  at  any  instant  of  time  t.  (12) 

Eq.  (12)  can  be  easily  translated  as  the  orthogonalization  of  the  output  correlation 
matrix  at  any  instant  of  time  t  as  in  Eq.  (13) 

(t)}W  =  D^it)  at  any  instant  of  time  t.  (13) 

Here  we  propose  to  use  batch  learning  with  non-overlap  windowed  estimates.  The 
blind  source  separation  rule  for  nonstationary  signals  becomes 
for  the  batch 

^m+i  =  Subject  to  for  all  i  =1,  2, ...,  N, 

where  Wj„(i)  is  the  i^^  column  vector  of  W^. 

Cq  (m)  has  to  be  used  in  Eq.  (7)  to  calculate 


4.  Blind  source  separation  of  linear  convolutive  mixture 

If  we  have  the  sensed  signal  X{t)  =  [Xi(t)  ^2(1), ...,  composed  by  a  linear 


500 


convolutive  mixture  of  sources,  X{z)  =  H{z)S{z),  where 


^11(2)  ••• 

5l(z) 

H{z)  = 

h.fz)  ... 
A^j(z)  ... 

S{z)  = 

The  solution  7(z)  for  separated  signals  is  Y{z)  =  PK{z)ir'^{z)  where  is  a  permuta¬ 
tion  matrix  and  K{z)  is  a  diagonal  matrix  with  some  arbitrary  shaping  filters  as  its 
diagonal  elements.  We  can  modify  Eq.  (8)  as 

Cq(t)  =  (14) 


where  X(t)  is 

In  addition,  we  also  need  to  modify  the  separation  weight  matrix  W  as 


= 


^1  *^21 

^12  ^22  ••• 

.  5^.. 

... 

ij 

: 

••• 

<iJ<N 

(15) 


(16) 


of  length  L.  With  the  new  definitions  of  Eq.  (14),  (15)  and  (16),  we  can  apply  Eq. 
(7)  to  solve  for  linear  convolutive  mixture. 


5.  Simulation  results 

Since  the  solution  for  blind  source  separation  is  a  subset  of  blindly  decorrelated 
outputs  we  focus  our  first  experiment  on  blind  source  separation  of  instantaneous 
mixed  signals.  We  select  speech  signals  spoken  by  two  male  speakers  (TIMIT  data¬ 
base),  and  we  artificially  mix  them  with  a  matrix  A.  We  acknowledge  that  this  a 
simplified  problem,  but  it  is  one  where  we  have  control  of  the  experiments  to  test 
the  new  algorithm.  The  method  can  be  easily  extended  to  any  iVby  A^case. 

We  choose  the  mixing  matrix  A  randomly  and  average  the  performance  by  sixty 
Monte  Carlo  trials  with  sixty  different  random  initial  weights.  Then  we  run  the 
experiments  for  200  epochs  according  to  our  proposed  adaptation  rule.  We  seek 
with  this  experiment  to  find  how  reliable  is  the  method,  i.e.  how  many  times  the 
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solution  is  found  and  what  is  the  variance  in  the  estimates.  Figure  2  and  3  plot  the 
mean  Frobenius  distances  given  by  Eq.  4  versus  epoch  number  with  the  corre¬ 
sponding  one  standard  deviation  errorbars  (upper  /  lower  limits  of  performance 
curves).  In  Figure  2  the  criterion  is  Jq  and  all  the  Monte  Carlo  runs  converged  to  the 
true  solution  in  less  than  20  epochs.  On  the  other  hand,  Figure  3  shows  the  fast  con¬ 
vergence  of  our  algorithm  when  we  try  to  minimize  the  combined  criterion 

3 


Z  -'q  .  Among  the  trials  we  select  one  to  illustrate  the  convergence  results.  The 
q  =  0 

particular  instantaneous  mixture  matrix  was 


0.7012  0.7622 
0.9103  0.2625 


(17) 


After  only  4  to  12  learning  epochs,  the  system  reaches  the  optimization  with  the 
separation  weight  matrix  We  can  investigate  if  the  product  of  two  matrices 

W^A  is  actually  in  the  form  of  PD  as  previously  described  [13].  Consequently  we 
normalize  each  row  of  the  product  matrix  IF^opt  ^  t>y  the  absolute  value  of  its  dom¬ 
inant  element  and  the  product  as 


= 

opt 


-0.0025  -1.0000 
1.0000  0.0055 


(18) 


From  Eq.  (15)  we  can  see  that  we  have  really  removed  almost  all  the  interference 
from  the  other  source  in  this  trial  picked  at  random.  From  Eq.  (18)  we  can  see  that 
SNR  (signal-to-noise  ratio)  is  52.04  dB  at  receiver  1  and  45.19  dB  at  receiver  2 
respectively.  The  distortion  is  almost  unnoticeable. 


In  our  second  experiment,  for  a  convolutive  mixture  of  two  sources,  we  choose  an 
arbitrary  mixing  matrix 


H{z)  = 


1 

0.7z“’  +  0.42“^  +  0.25/“^ 


0.85z  ^  +  0.1z  ^ 


(19) 


The  algorithm  needs  approximately  150  epochs  to  converge  as  shown  by  the  learn¬ 
ing  curve  plotted  in  Figure  4  (L  =  6  in  Eq.  (15),  d  =  20  in  Eq.  (4)  and  the  energy  of 
5/(t)  and  5^(0  are  equal  for  this  case).  Since  the  convolutive  model  is  more  compli¬ 
cated  we  cannot  simply  apply  the  product  in  Eq.  (18)  to  investigate  the  simulation 
result.  However  we  can  compute  the  output  as 


Y{z)  =  W^(z)H{z)S(z)  = 


TOjj(z)  ... 

...  rn..(z)  ... 

^  1 


.IT 


H(z)S{z). 


(20) 
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where  is  the  z-transform  of  w  for  1  <  /,  /  <N,  and  H(z),  S{z)  were  previ¬ 

ously  defined.  Hence  from  Eq.  (20)  the  outputs  for  the  simulation  of  the  convolu- 
tive  model  can  be  obtained  as 


y.(z)  =  a.^{z)S^{z)  +  a^2^z)S2{z)  for  r=  1,2.  (21 

or  y.(t)  =  yii(i)+yi2(^)  where  y^/t)  is  the  inverse  z-transform  of  a. ^.(z)5y(z) . 
According  to  Eq.  (21)  the  energy  matrix  can  be  defined  as  [8] 


Syii(t)  Iyi2(t) 

t  t 

t  t 


and  it  is 


0.0776  1,1534 
1.0787  0.0946 


in  our  simulation. 


The  SNR  are  14.86  dB  at  receiver  1  and  11.40  dB  at  receiver  2  respectively,  which 
is  reasonable  for  the  size  of  the  filters  employed.  The  two  original  signals,  die  two 
convolutively  mixed  signals  and  the  recovered  signals  are  all  depicted  in  Figure  5. 
In  listening  tests,  each  output  channel  is  dominated  by  a  single  recovered  signal. 
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Original  signal  2  Coupled  signal  2  Recovered  signal  2 

Figure  5.  Signals  for  the  simulation  of  linear  convolutive  mixture _ 

6.  Conclusion 

We  propose  a  generalized  criterion  based  on  the  Frobenius  norm  for  both  blind 
source  separation  and  decorrelation  lifting  a  previous  restriction  on  the  positive  def¬ 
initeness  of  covariance  matrices.  Since  the  method  is  based  on  the  minimization  of 
a  cost  function  it  leads  to  on-line  adaptation  algorithms.  However,  the  algorithm  is 
not  local  and  the  computational  complexity  is  0(A^^),  where  N  is  the  size  of  the  net¬ 
work.  The  method  displays  fast  and  robust  convergence  as  shown  by  Monte  Carlo 
runs.  The  method  is  applicable  to  nonstationary  signals  like  speech.  If  the  signals 
are  assumed  stationary,  the  time-delayed  decorrelation  algorithm  can  be  also  used 
as  an  iterative  alternative  for  blind  separation  similar  to  [7].  The  cost  fimction  was 
extended  to  convolutive  mixtures. 
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Neural  Networks  for  Intelligent  Multimedia  Processing 

S.Y.Kung 

Princeton  University 

Multimedia  technologies  represent  a  new  ground  for  research  interactions  among  a 
variety  of  medias  such  as  speech,  audio,  image,  video,  text,  and  graphics.  Future 
multimedia  technologies  will  need  to  handle  information  from  multiple  signal 
sources  with  an  increasing  level  of  intelligence,  i.e.  automatic  recognition  capabili¬ 
ties.  On  the  other  hand,  neural  processing  is  well-known  to  be  an  attractive  means 
for  implementing  robust  pattern  recognition.  It  offers  a  good  means  for  image  and 
video  segmentation  and  content-based  indexing  and  retrieval.  Unsupervised  clus¬ 
tering  and  training  by  example  are  popular  neural  learning  mechanisms.  By  these, 
machines  may  be  taught  to  interpret  possible  variations  of  an  object:  e.g.  scale,  ori¬ 
entation/rotation,  translation,  contrast,  and  perspective.  Ultimately,  machines  can 
be  “trained”  to  see  or  hear,  to  recognize  objects  or  faces,  and  to  perceive  human 
gestures  or  even  emotions.  In  addition  to  adaptive  learning,  other  useful  character¬ 
istics  include  layered  or  hierarchical  neural  models  and  spatial/temporal  processing 
models  (as  in  temporal  and  static  neural  network  structures).  Some  neural  models 
have  also  effectively  incorporated  statistical  signal  processing  (expectation-maxi¬ 
mization,  Gaussian  mixtures)  and  optimization  techniques  (annealing).  These  key 
features  of  neural  information  processing  have  proven  to  be  instrumental  and  effec¬ 
tive  to  many  applications  in  intelligent  multimedia  processing  (IMP).  It  is  therefore 
envisioned  that  a  major  impact  may  be  achieved  by  integrating  adaptive  neural  pro¬ 
cessing  into  the  state-of-the-art  multimedia  technologies.  Neural  processing  and 
IMP  share  the  following  characteristics: 

Digital  Media  Carrying  Voluminous  Spatial-Temporal  Data:  Tons  of  audiovi¬ 
sual  information  are  available  in  digital  form,  accessible  via  internet  to/ffom  all 
places  around  the  world. 

Multi-Modality  (Multiple  Sensor/Data  Sources):  Joint  processing  of  multi- 
media  could  result  in  significant  advantages. 

Trend  Towards  Intelligent  Information  Processing:  So  far,  only  text-based 
search  engines  are  available  on  the  WWW.  (In  fact,  they  are  among  the  most 
frequently  visited  sites.)  Some  multimedia  databases  can  offer  very  limited 
searching  capabilities  for  pictures  via  color/texture  features  or  information 
about  the  shape  of  objects  in  the  picture.  Advanced  and  more  reliable  indexing 
and  retrieval  techniques  are  not  yet  available  on  the  market. 
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Many  “laboratory”  successes  of  neural  networks  for  IMP  applications  have  been 
quite  encouragingly,  reported.  Examples  include  speech  recognition/understanding; 
character  recognition;  texture  classification;  image/video  segmentation;  face-object 
detection/recognition;  tracking  of  3D  objects;  and  lip  reading  via  multi-modality 
combining  visual  and  acoustic  processing. 

The  potential  applications  of  neural  networks  for  IMP  spread  over  a  much  broader 
spectrum:  education  (remote  learning);  shopping  (searching  clothes  or  fashion 
designs  or  a  3D  house  model);  digital  libraries  (image  catalog);  journalism  and 
multimedia  editing  (personalized  electronic  news  service;  media  authoring;  search¬ 
ing  video  clips  of  a  celebrity)  Multimedia  directory  services  (yellow  pages);  enter¬ 
tainment  and  medical  applications;  etc.  For  example,  it  is  not  possible  to  efficiently 
search  the  web  for,  say,  a  sample  picture  or  video  clip,  but  shot  from  a  different 
angle.  In  fact,  there  exists  no  generally  recognized  description  of  audiovisual  con¬ 
tents.  The  research  frontier  today  is  gradually  moving  from  what  was  primarily 
focused  on  coding  (MPEG2  and  MPEG4)  to  a  new  focus  on  automatic  recognition. 
This  trend  is  precipitated  by  a  new  member  of  the  MPEG  family:  MPEG-7.  MPEG- 
7  focuses  on  “multimedia  content  description  interface”.  Its  goal  is  to  extend  the 
current  search  capabilities  to  include  more  information  types.  Specifically,  MPEG- 
7  will  specify  a  standardized  description  of  various  types  of  multimedia  informa¬ 
tion,  including:  still  pictures,  graphics,  audio,  moving  video,  and  information  about 
how  these  elements  are  combined  in  a  multimedia  presentation  (’scenarios',  compo¬ 
sition  information).  This  description  shall  be  associated  with  the  content  itself,  to 
facilitate  fast  and  efficient  searching  for  all  the  aforementioned  medias.  MPEG7 
research  domain  will  cover  techniques  for  content-based  indexing  and  retrieval: 
pattern  recognition,  face  detection/recognition,  fusion  of  multi-modality.  For  these, 
neural  networks  offer  a  very  promising  core  technology. 
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Abstract 

A  novel  method  for  regression  has  been  recently  proposed  by 
V.  Vapnik  et  al.  [8,  9].  The  technique,  called  Support  Vector 
Machine  (SVM),  is  very  well  founded  from  the  mathematical 
point  of  view  and  seems  to  provide  a  new  insight  in  function 
approximation.  We  implemented  the  SVM  and  tested  it  on 
the  same  data  base  of  chaotic  time  series  that  was  used  in 
[1]  to  compare  the  performances  of  different  approximation 
techniques,  including  polynomial  and  rational  approximation, 
local  polynomial  techniques,  Radial  Basis  Functions,  and  Neu¬ 
ral  Networks.  The  SVM  performs  better  than  the  approaches 
presented  in  [1].  We  also  study,  for  a  particular  time  series, 
the  variability  in  performance  with  respect  to  the  few  free 
parameters  of  SVM. 


1  Introduction 

In  this  paper  we  analyze  the  performance  of  a  new  regression  technique  called 
a  Support  Vector  Machine  [8,  9],  This  technique  can  be  seen  as  a  new  way 
to  train  polynomial,  neural  network,  or  Radial  Basis  Functions  regressors. 
The  main  difference  between  this  technique  and  many  conventional  regres¬ 
sion  techniques  is  that  it  uses  the  Structural  Risk  Minimization  and  not  the 
Empirical  Risk  Minimization  induction  principle.  Since  this  is  equivalent  to 
minimizing  an  upper  bound  on  the  generalization  error,  rather  than  mini¬ 
mizing  the  training  error,  this  technique  is  expected  to  perform  better  than 
conventional  techniques.  Our  results  show  that  SVM  is  a  very  promising 
regression  technique,  but  in  order  to  assess  its  reliability  and  performances 
more  extensive  experimentation  will  need  to  be  done  in  the  future.  We  begin 
by  applying  SVM  to  several  chaotic  time  series  data  sets  that  were  used  by 
Casdagli  [1]  to  test  and  compare  the  performances  of  different  approximation 
techniques.  The  SVM  is  a  technique  with  few  free  parameters.  In  absence  of 
a  principled  way  to  choose  these  parameters  we  performed  an  experimental 
study  to  examine  the  variability  in  performance  as  some  of  these  parameters 
vary  between  reasonable  limits.  The  paper  is  organized  as  follows.  In  the 
next  section  we  formulate  the  problem  of  time  series  prediction  and  see  how 
it  is  equivalent  to  a  regression  problem.  In  section  3  we  briefly  review  the 
SVM  approach  to  the  regression  problem.  In  section  4,  the  chaotic  time  series 
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used  to  benchmark  previous  regression  methods  [1]  are  introduced.  Section 
5  contains  the  experimental  results  and  the  comparison  with  the  techniques 
presented  in  [1].  Section  6  focuses  on  a  particular  series,  the  Mackey-Glass, 
and  examines  the  relation  between  parameters  of  the  SVM  and  generalization 
error. 


2  Time  Series  Prediction  and  Dynamical  Sys¬ 
tems 

For  the  purpose  of  this  paper  a  dynamical  system  is  a  smooth  map  F  :  RxS 
S  where  S  is  an  open  set  of  an  Euclidean  space.  Writing  F{t,x)  =  F’t(x), 
the  map  F  has  to  satisfy  the  following  conditions: 

1.  Fo{x)  =  x; 

2,  FtiF.ix))  =  Fs+iix)  ys.teR 

For  any  given  initial  condition  xq  =  Fq{x)  a  dynamical  system  defines  a  tra¬ 
jectory  x{t)  =  Ft{xo)  in  the  set  S.  The  direct  problem  in  dynamical  systems 
consists  in  analyzing  the  behavior  and  the  properties  of  the  trajectories  x(/) 
for  different  initial  conditions  xq.  We  are  interested  in  a  problem  similar 
to  the  inverse  of  the  problem  stated  above.  We  are  given  a  finite  portion 
of  a  time  series  x(t),  where  ic  is  a  component  of  a  vector  x  that  represents 
a  variable  evolving  according  to  some  unknown  dynamical  system.  We  as¬ 
sume  that  the  trajectory  x{t)  lies  on  a  manifold  with  fractal  dimension  D  (a 
“strange  attractor”).  Our  goal  is  to  be  able  to  predict  the  future  behavior 
of  the  time  series  x{t).  Remarkably,  this  can  be  done,  at  least  in  principle, 
without  knowledge  of  the  other  components  of  the  vector  x{t).  In  fact,  Tak- 
ens  embedding  theorem  [7]  ensures  that,  under  certain  conditions,  for  almost 
all  T  and  for  some  m  <  2D-\- 1  there  is  a  smooth  map  /  :  7Z^  R  such  that: 

x{nr)  =  f(x{{n-l)T),  x({n  -  2)r), . . , ,  x{{n  -  m)T))  (1) 

The  value  of  m  used  is  called  the  embedding  dimension  and  the  smallest 
value  for  which  (1)  is  true  is  called  the  minimum  embedding  dimension,  m* . 
Therefore,  if  the  map  /  were  known,  the  value  of  x  at  time  nr  is  uniquely 
determined  by  its  m  values  in  the  past.  For  simplicity  of  notation  we  define 
the  m-dimensional  vector 

x„_i  =  (rc((n  -  l)r),  x{{n  -  2)r), . . . ,  aj((n  -  m)r)) 

in  such  a  way  that  eq.  (1)  can  be  written  simply  as  x{nT)  =  f(xn-i)- 
If  N  observations  of  the  time  series  x(t)  are  known,  then  one 

also  knows  N  —  m  values  of  the  function  /,  and  the  problem  of  learning 
the  dynamical  system  becomes  equivalent  to  the  problem  of  estimating  the 
unknown  function  /  from  a  set  oi  N  —  m  sparse  data  points  in  'RF' .  Many 
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regression  techniques  can  be  used  to  solve  problems  of  this  type.  In  this  paper 
we  concentrate  on  the  Support  Vector  algorithm,  a  novel  regression  technique 
developed  by  V.  Vapnik  et  al.  [9]. 


3  Support  Vectors  Machines  for  Regression 

In  this  section  we  sketch  the  ideas  behind  the  Support  Vectors  Machines 
(SVM)  for  regression,  a  more  detailed  description  can  be  found  in  [9]  and  [8]. 
In  a  regression  problem  we  are  given  a  data  set  G  =  {(x,-,  obtained 

sampling,  with  noise,  some  unknown  function  ^(x)  and  we  are  asked  to  de¬ 
termine  a  function  /  that  approximates  5f(x),  based  on  the  knowledge  of  G. 
The  SVM  considers  approximating  functions  of  the  form: 

D 

/(x,c)  =  ^Cii^i(x)  +  6  (2) 

i-1 

where  the  functions  are  called  features,  and  b  and  are  coef¬ 

ficients  that  have  to  be  estimated  from  the  data.  This  form  of  approximation 
can  be  considered  as  an  hyperplane  in  the  D-dimensional  feature  space  de¬ 
fined  by  the  functions  The  dimensionality  of  the  feature  space  is  not 

necessarily  finite,  and  we  will  present  examples  in  which  it  is  infinite.  The 
unknown  coefficients  are  estimated  by  minimizing  the  following  functional: 

1  ^ 

i=:l 

where  A  is  a  constant  and  the  following  robust  error  function  has  been  defined: 


Vi  ~/(xi,c)  \e= 


0  if  I  Vi  -f{xi,c)  |<  e 

I  yi  -  /(xi,  c)  I  otherwise. 


(4) 


Vapnik  showed  in  [8]  that  the  function  that  minimizes  the  functional  in  eq. 
(3)  depends  on  a  finite  number  of  parameters,  and  has  the  following  form: 


N 

f(x,a,a*)  =  -  ai)K{x,Xi)-{-b,  (5) 

i=l 

where  =  0,  a^,  oj  >  0  ?  =  1, . . . ,  V,  and  A"(x,  y)  is  the  so  called  kernel 
function,  and  describes  the  inner  product  in  the  D-dimensional  feature  space: 


D 

A(x,y)  =  ^(^i(x)<^i(y) 
i=l 

The  interesting  fact  is  that  for  many  choices  of  the  set  includ¬ 

ing  infinite  dimensional  sets,  the  form  of  K  is  analytically  known  and  very 
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simple,  and  the  features  (l>i  never  need  to  be  computed  in  practice  because 
the  algorithm  relies  only  on  computation  of  scalar  products  in  the  feature 
space.  Several  choices  for  the  kernel  K  are  available,  including  gaussians, 
tensor  product  S-splines  and  trigonometric  polynomials.  The  coefficients  a 
and  a*  are  obtained  by  maximizing  the  following  quadratic  form: 

N  N  ^  N 

R(a*,a)  =  (a*-\-ai){a*-0'i)K{xi,Xj), 

*=i  j=i  i,;=l 

(6) 

subject  to  the  constraints  0  <  a*,  a*  <  C  and  —  Off)  =  0.  Due  to  the 

nature  of  this  quadratic  programming  problem,  only  a  number  of  coefficients 
Of*  —  ai  will  be  different  from  zero,  and  the  data  points  associated  to  them 
are  called  support  vectors.  The  parameters  C  and  e  are  two  free  parameters 
of  the  theory,  and  their  choice  is  left  to  the  user.  They  both  control  the 
VC-dimension  of  the  approximation  scheme,  but  in  different  ways.  A  clear 
theoretical  understanding  is  still  missing  and  we  plan  to  conduct  experimental 
work  to  understand  their  role. 


4  Benchmark  Time  Series 

We  tested  the  SVM  regression  technique  on  the  same  set  of  chaotic  time 
series  that  has  been  used  in  [1]  to  test  and  compare  several  approximation 
techniques. 


4.1  The  Mackey- Glass  time  series 

We  considered  two  time  series  generated  by  the  Mackey-Glass  delay-differential 
equation  [4]: 


^  =  -o.MO  + 


0.2x{t  -  A) 


(7) 


with  parameters  A  =  17, 30  and  embedding  dimensions  m  =  4, 6  respectively. 
We  denote  these  two  time-series  by  MGir  and  MGso-  In  order  to  be  con¬ 
sistent  with  [1]  the  initial  condition  for  the  above  equation  was  x{t)  =  0.9 
for  0  <  t  <  A,  and  the  sampling  rate  r  =  6.  The  series  were  generated  by 
numerical  integration  using  a  fourth  order  Runge-Kutta  method. 


4.2  The  Ikeda  map 

The  Ikeda  map  [2]  is  a  two  dimensional  time  series  which  is  generated  iterating 
the  following  map: 

f{xi,X2)  =  (1  +  cosw  -  a72sina;),/i(a;isina;  -{-  X2  cosuj)),  (8) 
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where  w  =  0.4  -  6.0/(l  +  iCi  +  xl).  In  [1]  Casdagli  considered  both  this  time 
series,  that  we  will  denote  by  Ikedai,  and  the  one  generated  by  the  fourth 
iterate  of  this  map,  which  has  a  more  complicated  dynamic,  and  that  will  be 
denoted  by  Ikeda^. 

4.3  The  Lorenz  time  series 

We  also  considered  the  time  series  associated  to  the  variable  x  of  the  Lorenz 
dilferential  equation  [3]; 


x  =  (T{y-x),  y  =  rx-y-xz,  z  =  xy  -  bz  (9) 

where  (t  =  10,  6  =  |,  and  r  =  28.  We  considered  two  different  sampling 
rates,  r  =  0.05  and  r  =  0.20,  generating  the  two  time  series  Lorenzo. 05  and 
Lorenzo. 20-  The  series  were  generated  by  numerical  integration  using  a  fourth 
order  Runge-Kutta  method. 


5  Comparison  with  Other  Techniques 

In  this  section  we  report  the  results  of  the  SVM  on  the  time  series  presented 
above,  and  compare  them  with  the  results  reported  in  [1]  about  different  ap¬ 
proximation  techniques  (polynomial,  rational,  local  polynomial,  Radial  Basis 
Functions  with  multiquadrics  as  basis  function,  and  Neural  Networks).  In  all 
cases  a  time  series  {x{nT)]^^^ ,  was  generated:  the  first  N  points  were  used 
for  training  and  the  remaining  M  points  were  used  for  testing.  In  all  cases 
N  was  set  to  500,  except  for  the  Ikeda^,  for  which  N  =  100,  while  M  was 
always  set  to  1000.  The  data  sets  we  used  were  the  same  that  were  used  in 
[1].  Following  [1],  denoting  by  /jv  the  predictor  built  using  N  data  points, 
the  following  quantity  was  used  as  a  measure  of  the  generalization  error  of 

In' 


„2(f  1  V  (^(»^)  - 

n=A?^+l 

where  Var  is  the  variance  of  the  time  series.  We  implemented  the  SVM  using 
MINOS  5.4  [5]  as  the  solver  for  the  Quadratic  Programming  problem  of  eq. 
(6).  Details  of  our  implementation  can  be  found  in  [6].  For  each  series  we 
choose  the  kernel,  K,  and  parameters  of  the  kernel  that  gave  us  the  smallest 
generalization  error.  This  is  consistent  with  the  strategy  adopted  in  [1].  The 
results  are  reported  in  table  (1).  The  last  column  of  the  table  contains  the 
results  of  our  experiments,  while  the  rest  of  the  table  is  from  [1]  with  param¬ 
eters  and  kernels  set  as  in  the  remaining  part  of  this  section. 
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Mackey-Glass  time-series;  the  kernel  K  was  chosen  to  have  the  following 
form: 


/'('(x.Xi) 


d-l 


-4-  —  x\  ^)) 

sin(0.5(a:(^)  — 


(11) 


where  x^^^  is  the  d-th  component  of  the  m-dimensional  vector  x,  and  a  z/  is  an 
integer.  This  kernel  generates  an  approximating  function  that  is  an  additive 
trigonometric  polynomial  of  degree  i/,  and  correspond  to  features  that  are 
trigonometric  monomials  up  to  degree  z/.  We  tried  various  values  for  e  and 
C.  The  embedding  dimension  for  the  series  MG  17  and  MG30  were  m  =  4 
and  m  =  6  in  accordance  with  the  work  by  Casdagli,  and  we  used  u  —  200 
and  €  =  10”^. 


Lorenz  time-series:  For  the  Lorenzo. 05  and  Lorenzo .20  series  the  polyno¬ 
mial  kernels  of  order  6  and  10  were  used.  The  embedding  dimensions  used 
were  6  and  10  respectively.  The  value  of  e  used  was  10“^. 


Ikeda  map;  a  B-spline  of  order  3  was  used  as  the  kernel,  and  the  value  of  e 
was  10“^. 


Poly 

RBF 

N.Net 

SVM 

MGu 

-1.95(7) 

-1.14(2) 

-1.48 

-1.89 

-1.97 

-2.00 

-2.36  (258) 

-1.40(*) 

-1.33(2) 

-1.24 

-1.42 

-1.60 

-5.57(12) 

-8.0l(») 

-1.71 

-2.34 

-2.95 

- 

Ikeda^ 

-1.26 

-1.60 

-2.10 

- 

-2.31  (427) 

LoTqos 

-2.00 

-3.48 

-3.54 

- 

-4.76  (389) 

Loro,2Q 

-1.05(^) 

-1.39(®) 

-1.26 

-1.60 

-2.10 

- 

-2.21  (448) 

Table  1:  Estimated  values  of  logjo  <T(/n)  for  the  SVM  algorithm  and  for  various 
regression  algorithms,  as  reported  in  [1].  The  degrees  used  for  the  best  rational  and 
polynomial  regressors  are  in  superscripts  beside  the  estimates.  Loc®^”^  and  Loc'^“^ 
refer  to  local  approximation  with  polynomials  of  degree  1  and  2  respectively.  The 
numbers  in  parenthesis  near  the  SVM  estimates  are  the  number  of  support  vectors 
obtained  by  the  algorithm.  The  Neural  Networks  results  which  are  missing  were 
also  missing  in  [1]. 


6  Sensitivity  of  SVM  to  Parameters  and  Em¬ 
bedding  Dimension 

In  this  section  we  report  our  observations  on  how  the  generalization  error 
and  the  number  of  support  vectors  vary  with  respect  to  the  free  parameters 
of  the  SVM  and  to  the  choice  of  the  embedding  dimension.  The  parameters 
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error 


we  analyze  are  therefore  C,  e,  the  dimensionality  of  the  feature  space  D,  and 
the  embedding  dimension  m.  All  of  these  results  are  for  the  MGu  series. 
Figure  la  demonstrates  that  C  has  little  effect  on  the  generalization  error 
(the  plot  spans  over  7  orders  of  magnitude).  The  parameter  C  has  also  little 
effect  on  the  number  of  support  vector,  as  shown  in  figure  lb,  which  remains 
almost  constant  in  the  range  10“^  —  10^.  The  results  were  similar  for  kernels 
with  low  (D  =  2),  high  {D  =  802)  and  infinite  dimensionality  of  the  feature 
spaces. 


(a) 


(b) 


Figure  1:  (a)  The  generalization  error  versus  C  for  the  MG17  series,  (b) 
The  number  of  support  vectors  versus  C  for  the  same  series.  The  kernel  was 
an  additive  trigonometric  polynomial  with  1/  =  200. 


The  parameter  e  has  a  strong  effect  on  the  number  of  support  vectors  and  on 
the  generalization  error,  and  its  relevance  is  related  to  D.  In  order  to  see  why 
this  happens,  remember  that  if  I[c]  and  /emp[c]  are  respectively  the  expected 
risk  and  empirical  risk,  with  probability  1  —  77: 


^[c]  <  /emp[c]  +  Tl 


fe(log^  +  l)-losf 


N 


(12) 


where  r  is  a  bound  on  the  cost  function  used  to  define  the  expected  risk  and 
h  is  the  VC-dimension  of  the  approximation  scheme.  It  is  known  that  the 
VC-dimension  satisfies  h  <  min('^  ,  D)  -f  1[10],  where  R  is  the  radius 

of  the  smallest  sphere  that  contains  all  the  data  points  in  the  feature  space, 
A  is  a  bound  on  the  norm  of  the  vector  of  coefficients.  When  D  is  small,  the 
VC-dimension  h  is  not  dependent  on  e  and  the  second  term  on  the  bound 
of  the  generalization  error  is  constant  and  therefore  a  very  small  e  does  not 
cause  overfitting.  For  the  same  reason  when  D  is  large  the  term  is  very 
sensitive  to  e  and  overfitting  occurs  for  small  e.  Numerical  results  confirm 
this.  For  example,  figures  2  and  3  which  correspond  to  feature  spaces  of  802 
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L2  error 


and  infinite  dimensions,  respectively,  show  overfitting.  (The  kernels  used  were 
the  additive  trigonometric  polynomial  with  v  ~  200  and  a  B-spline  of  order 
3,  respectively.)  Figure  4  corresponds  to  a  feature  space  of  10  dimensions 
and  there  is  no  overfitting.  (The  kernel  used  was  the  additive  trigonometric 
polynomial  with  i/  =  2.) 


Epsilon  vs  L2  error  2  Epsilon  vs  Support  vectors 


epsilon  epsilon 


(a)  (b) 

Figure  2:  (a)  The  1?  generalization  error  versus  c  with  a  802  dimensional 
feature  space.  The  inset  magnifies  the  boxed  region  in  the  lower  left  section 
of  the  plot.  Note  that  overfitting  occurs,  (b)  The  number  of  support  vectors 
versus  e  for  the  same  feature  space. 


1  Epsilon  vs  L2  error  Epsilon  vs  Support  vectors 


epsilon  epsilon 


(a)  (b) 

Figure  3:  (a)  The  1?  generalization  error  versus  e  with  an  infinite  dimensional 
feature  space.  The  inset  magnifies  the  boxed  region  in  the  lower  left  section 
of  the  plot.  Note  that  ovefitting  occurs  (b)  The  number  of  support  vectors 
versus  e  for  the  same  feature  space. 
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error 


Epsilon  vs  Support  vectors 


Figure  4:  (a)  The  generalization  error  versus  e  with  a  10  dimensional 
feature  space.  Note  that  there  is  no  overfitting  (b)  The  number  of  support 
vectors  versus  e  for  the  same  feature  space. 


The  effect  of  the  embedding  dimension  m  on  generalization  error  was  also 
examined.  According  to  Takens  theorem  the  generalization  error  should  de¬ 
crease  as  m  approaches  the  minimum  embedding  dimension,  m* .  Above  m* 
there  should  be  no  decrease  in  the  generalization  error.  However,  if  the  regres¬ 
sion  algorithm  is  sensitive  to  overfitting  the  generalization  error  can  increase 
for  m  >  m*.  The  minimal  embedding  dimension  of  the  MGu  series  is  4. 
Our  numerical  results  demonstrate  the  SVM  does  not  overfit  for  the  case  of 
a  low  dimensional  kernel  and  overfits  slightly  for  high  dimensional  kernels, 
see  figure  5.  The  additive  trigonometric  polynomial  with  i'  =  200  and  2  were 
used  for  this  figure. 

7  Discussion 

The  SVM  algorithm  showed  excellent  performances  on  the  data  base  of  chaotic 
time  series,  outperforming  the  other  techniques  in  the  benchmark  in  all  but 
one  case.  The  generalization  error  is  not  sensitive  to  the  choice  of  C,  and  very 
stable  with  respect  to  e  in  a  wide  range.  The  variability  of  the  performances 
with  €  and  D  seems  consistent  with  the  theory  of  VC  bounds. 
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Embecading  dimension  vs  L2  er-z-or- 


^nbedding  dimension 


Figure  5:  The  generalization  error  versus  the  embedding  dimension.  The 

solid  line  is  for  a  802  dimensional  feature  space  and  the  dashed  line  is  for  a 

10  dimensional  feature  space. 
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Abstract 

We  present  an  adaptive  metric  learning  vector  quantization  procedure  based  on  the 
discrete-cosine  transform  (DCT)  for  accurate  face  recognition  used  in  multimedia 
applications.  Since  the  set  of  learning  samples  may  be  small,'  we  employ  a 
mixture  model  of  prior  distributions.  The  model  selection  method,  which 
minimizes  the  cross  entropy  between  the  real  distribution  and  the  modeled  one,  is 
presented  to  optimize  the  mixture  number  and  local  metric  parameters.  The 
structural  risk  minimization  is  used  to  facilitate  an  asymptotic  approximation  of 
the  cross  entropy  for  models  of  fixed  complexity.  We  also  provide  a  formula  to 
estimate  the  model  complexity  derived  from  the  minimum  description  length 
criterion.  The  structural  risk  minimization  method  proposed  achieves  an 
recognition  error  rate  of  2.29%  using  the  ORL  database,  which  is  better  than 
previously  reported  numbers  using  the  Karhunen-Loeve  transform  convolution 
network,  the  hidden  Marcov  model  and  the  eigenface  model. 


1.  INTRODUCTION 

In  this  paper,  we  describe  an  adaptive  metric  learning  vector  quantization  (LVQ) 
using  the  discrete  cosine  transform  (DCT)  for  face  classification  in  multimedia 
applications,  such  as  used  in  camera  and  facial  signature  recognition. 

In  the  past,  the  Karhunen-Loeve(KL)  transform  and  principal  component 
analysis  (PC A)  have  been  successfully  used  for  face  feature  detection  [1][2].  We 
employ  the  DCT  transform  with  the  added  advantage  of  having  a 
computationally-efficient  and  data-independent  matrix  [3]  as  an  alternative  to  the 
KL  transform  which  requires  data-dependence  eigenvectors  as  a  priori  information. 

Another  approach  of  the  LVQ  procedure  as  described  in  [4]  is  an 
effective  clustering  method  for  a  large  set  of  training  samples.  However,  the 
performance  is  degraded  with  learning  from  a  small  set  of  samples.  A  number  of 
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promising  approaches  considering  local  probability  distribution  [5]  [6] [7]  [8] 
[9]have  also  been  proposed.  For  example,  Hastie  presents  the  normal  mixture 
model,  assuming  a  common  variance  matrix  for  all  classes.  However,  for  the 
learning  from  an  insufficient  set  of  learning  samples,  the  assumption  of  local 
distribution  is  crucial  [9],  which  makes  the  choice  on  the  number  of  mixture 
components  extremely  difficult  [5]. 

For  the  learning  from  a  small  set  of  samples,  we  propose  an  adaptive 
metric  LVQ,  based  on  a  mixture  model  of  local  prior  distributions.  Our  model 
assigns  a  variable  number  of  the  mixture  classes  to  each  class  and  then 
determines  the  mixture  number  and  local  metric  parameters  to  minimize  the  cross 
entropy  between  the  real  distribution  and  the  modeled  one.  In  the  model 
selection,  the  minimum  description  length  (MDL)  [10]  and  the  structural  risk 
minimization  (SRM)  principles  [11]  are  indispensable.  The  structural  risk 
minimization  of  negative  log-likelihood  has  been  introduced  as  an  asymptotic 
approximation  to  the  cross  entropy  minimization  in  cases  where  the  model 
parameters  have  fixed  complexity.  Moreover,  we  can  estimate  the  model 
complexity  kc  and  the  data  number  Np  using  the  complexity  formula  kc(logNp) 
/Np  derived  from  the  MDL  criterion.  It  provides  a  good  measure  to  make  the 
trade-offs  between  accuracy  and  complexity  in  determining  the  length  of  DCT 
coefficients  for  the  case  of  larger  kc/Np  ratio. 

The  local  minimization  of  the  entropy  distance  is  demonstrated  for  face 
recognition  of  the  ORL  face  database,  which  consists  of  40  persons  of  10 
different  poses  with  distinct  variations  such  as  open/close  eyes,  smiling/non¬ 
smiling  faces,  glasses/non-glasses  poses,  and  rotation  up  to  20  degrees.  The 
results  are  compared  with  those  obtained  from  using  the  hidden  Marcov  model 
[12],  PC  A,  convolution  network  [13]  based  on  KLT  features,  and  Kohonen’s 
self-organization  map. 


2.  ADAPTIVE  METRIC  MODEL 
2.1.  The  Learning  Algorithm 

An  adaptive  metric  LVQ  estimates  the  Mahalanobis  distance,  Z)(x^,  J* ) , 
between  an  input  vector  x’’  =(x  ^  ^  •  ■  ■,  x^’'^^)  and  a  reference  vector 
=('V  ^  ‘  * )>  assuming 
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Each  component  of  the  input  vector  is  assumed  to  be  independently  and 
identically  distributed  (i.i.d).  Therefore,  and  P(0*)  =  ‘  } 

For  each  presentation  of  an  input  vector  in  class  p,  we  estimate  the 
Mahalanobis  distance  and  update  components  of  reference  vectors  3c*  and 

for  the  first  and  second  nearest  neighbor  classes  k  and  /,  respectively,  as 
depicted  in  the  following  equations: 

■*««,  =  +  forp=fc  (2) 

=  (3) 

2.2.  Local  Information  Measure 


To  obtain  optimal  variances  from  small  learning  sets,  we  propose  a 
mixture  model  using  the  minimization  of  the  cross  entropy  between  an 
unknown  true  distribution  P(  jc*' )  and  a  modeled  one  Q(  jc*  ' ),  which  is  equal 
to  4ogP(x^'‘).  We  deal  with  the  empirical  risk  function 
in  th®  pince  of  the  cross  entropy  \Pi^')  Q(  x^’'  )  dx^‘‘  for  an 

i.i.d  sample  sequence  of  x‘'''j,...x^'\p.  According  to  the  Vapnik-Chervonenkis 
theory,  the  error  is  bounded  by 


P\  sup 


>e[^<4exp<-  + 


h{ln{2N^/h)  +  l)] 


(4) 


The  h  is  the  Vapnik-Chervonenkis  dimension  denoting  parametric  complexity 
and  Np  is  the  sample  size.  We  obtain  asymptotically  optimal  variances  from 
the  mixture  distributions  by  increasing  the  size  of  training  samples  even  in 
cases  where  the  learning  parameters  have  a  fixed  complexity  h. 

Figure  1  shows  a  schematic  illustration  of  structure  of  mixture  classes. 
Our  model  assumes  the  ('^-i-7>fold  product  of  mixture  distribution,  = 

77  to  derive  the  asymptotically  optimal  variance  cr^  of 

component  i  in  class  k.  Let  the  (q+1  )-fo\d  mixture  classes  be  provided 

with  a  structure: 
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Figure  1.  A  schematic  structure  of  mixture  classes  defined  by  using  a  likelihood 
function  of  mixture  number  j. 

The  7-th  mixture  class  m(j)  is  assigned  according  to  each  reference  class  k  based 
on  the  likelihood  measure  For  each  class,  the  minimal  size  of  the 

mixture  model  is  uniquely  derived  from  equation  (5).  Then,  an  optimal  variance 
is  obtained  by 


q  +  l 


(7) 


Furthermore  we  can  extend  this  model  by  introducing  the  complexity 
term  (kJogNp  /2+logM  }/  Np  based  on  the  minimum  description  length(MDL) 
criterion: 


(8) 


The  and  M  denote  the  dimension  of  the  model  and  the  maximum  number  of 
the  model  respectively.  This  is  equivalent  to  the  MDL  criterion  in  providing  a 
trade-off  between  correctness  and  complexity  in  selecting  the  distribution  model . 

2.3.  Number  of  DCT  Coefficients 


The  optimal  number  of  coefficients  in  the  DCT  approximation  of  gray  images 
can  be  derived  from  the  MDL  criterion: 


h.  =  -  log(/’(o-))  + 1  log(W, )  +  log  M , 


(9) 


assuming 


(10) 


(11) 


The  k,  Np  and  M  denote  the  dimension  of  the  model,  the  number  of  data  samples 
and  the  maximum  number  of  model  classes.  The  is  the  discrete  cosine 
transform  of  /(^y^  and  the  is  number  of  the  coefficients.  The  maximum 
likelihood  DCT  coefficients  are  obtained  by  minimizing  the  /j^,of  equation  (9). 


3.  RESULTS 


The  ORL  database  consists  of  40  persons  with  10  distinct  facial 
expressions.  The  original  image  of  114x  88  pixels  is  transformed  into  an  image 
of  16x16  pixels.  The  coefficients  derived  from  the  16x16  DCT  are  used  for 
learning. 


0  10  20  30  40  50  60 


Number  of  DCT  coefficients 

Figure  2.  Relationships  between  the  number  of  DCT  coefficients  and  recognition 
error  rate  and  mean  absolute  error. 


525 


First,  we  derive  an  optimal  number  of  DCT  coefficients.  Figure  2 
shows  the  relationships  between  the  numbers  of  coefficients  and  the  recognition 
error  rate  and  mean  absolute  error  for  the  coefficients  {F“(u  v)  I  0<  u,  v  <  N<jct } 
and  {F‘’(u,v)l  0  <  u  +  v  <  Nd,t }.  It  demonstrates  the  trade-off  between  error  rate 
and  complexity  in  the  DCT  approximation.  Increasing  the  coefficient  number 
reduces  the  approximation  error  but  increases  the  complexity  of  the  learning 
network  by  kJlogNp).  The  recognition  error  rates  reach  the  minimum  value  at 
around  36  coefficients. 

The  main  issue  of  constructing  our  learning  model  is  to  assume  proper 
input  pattern  distributions.  For  examples,  Hastie  proposes  a  normal  distribution 
sharing  common  component  variance  among  all  classes  [5].  Table  1  shows  error 
rate  and  description  length  obtained  from  the  distributions  with  three  kinds  of 
variances.  The  distribution  with  a  common  component  variance  specified  by  (c2) 
[5]  is  compared  with  a  distribution  with  different  component  variances  specified 
by  (c3)  and  with  that  of  Kohonen  LVQ  with  o*'’— 1  (cl).  The  result  of  the 
distribution  (c3)  is  worse  than  that  of  Kohonen  LVQ  with  0*^  —1.  The  error  rate 
obtained  from  the  distribution  (c2)  is  better  than  those  obtained  from  (cl)  and 
(c3).  The  learning  performance  significantly  depends  on  the  distribution 
assumptions.  We  ensure  the  goodness  of  the  assumptions  in  terms  of  description 
length.  The  description  length  derived  form  the  distributions  (c2)  and  (c3)  are 
1084  and  1616,  respectively.  The  assumption  used  in  Hastie  model  has  attained 
smaller  description  length  by  effectively  reducing  the  complexity  term  in 
equation  (8).  The  complexity  terms  derived  from  distributions  (c2)  and  (c3)  are 
(k,/2)  logNp  +log(  k,  Np  )  and  k^logNp.  (In  our  simulation,  the  number  of 
classes  k^  and  the  number  Np  of  patterns  are  40  and  5,  respectively.)  The 
distribution  proposed  by  Hastie  [5]  is  more  probable  when  the  complexity  term 
is  reduced,  although  it  is  the  very  restrictive  assumption. 

Table  1.  Error  rate  and  description  length  obtained  from  the  distributions  with  three 
kinds  of  variances. 


cl 

c2 

c3 

Variance 

ok,i=l 

j:dk,i 

-IW 

(^,i 

Error 

rater%1 

12.10 

9.12 

14.30 

Description 

length 

1084 

1616 
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To  reduce  the  first  term  of  equation  (8),  our  approach  employs  a  mixture 
model  of  normal  distributions  allowing  a  variable  number  of  mixture  classes.  To 
determine  an  optimal  size,  we  explore  minimization  of  several  metric  measures 
consisting  of  (ml)  cross  entropy,  (m2)  variance  distance  and  (m3)  mean  value 
distance,  which  are  defined  by  using  normal  distributions  of  classes  k  and  /. 
Table2  shows  the  error  rates  and  average  variances  obtained  from  fourfold 
mixture  distributions.  The  asymptotic  minimization  of  cross  entropy  is 
estimated  by  using  The  error  rate  obtained  from  the  cross  entropy 

minimization  is  7.68%,  which  is  better  than  8.24%  and  12.25%  obtained  from 
mean  value  distance  and  variance  distance  minimization. 

Table2  Error  rates  and  average  variances  obtained  from  fourfold  mixture  distributions. 


ml 

tn2 

m3 

Metric 

measure 

Error 

rater%l 

7.68 

12.25 

8.24 

Variance 

36.20 

38.82 

38.87 

For  component  inputs  of  DCT  coefficients,  the  entropy  distance  is 
defined  by  assuming  a  variable  number  of  i.d.d  mixture  classes.  Figure  3  shows 
the  relation  of  average  variance  and  error  rate  to  the  number  of  mixture  classes. 
We  compare  the  performance  of  using  our  model  specified  by  (m4),  with  results 
derived  from  a  fixed  size  of  mixture  classes  for  all  the  components  specified  by 
(ml)  and  (m2).  The  minimal  error  rates  of  7.65%  and  8.24%  or  the  variances  of 
618  and  960  are  obtained  at  the  size  of  3  and  4  for  the  entropy  distance  and  the 
centroid  distance.  Using  our  model,  the  error  rate  and  average  variance  are 
decreased  and  saturated  to  6.28%  and  516  with  an  increase  in  the  number  of  i.i.d 
mixture  classes.  The  essence  of  the  structural  risk  minimization  allows  variable 
size  of  the  mixture  classes  according  to  equation  (5)  for  i.i.d  DCT  components. 
The  error  rate  with  variable  size  of  mixture  classes  is  significantly  improved 
from  7.78%  to  6.28%  by  allowing  an  optimal  size  of  mixture  classes  defined  by 
the  entropy  distance. 
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Figure  3.  Relation  of  average  variance  and  error  rate  to  number  of  mixture  classes. 


Table  3.  Face  recognition  results  using  5  faces  for  training  and  5  testing  from  the 
ORL  database. 


DCT+ 

AMLVQ 

SOM+ 

CN 

KLT+ 

CN 

Pseudo2D 

HMM 

Eigenfaces 

Error 

rate[%] 

2.29 

3.8 

5.3 

5.0 

10.5 

Table  3  shows  the  face  recognition  results  using  5  faces  for  training  and 
5  testing  from  the  ORL  database.  Figure  4  shows  the  recognition  error  rates  as 
a  function  of  number  of  training  faces.  An  error  rate  of  2.29%  using  our  method 
is  achieved,  which  is  better  than  the  3.8%,  5.3%,  5%  and  10.5%  obtained  using 
SOM+CN,  KLT  convolution  network  [13],  Pseudo  2D  hidden  Markov  model 
[12]  and  dgenface  model  [13],  respectively.  Furthermore  the  previous  results 
were  obtained  with  averaging  two  best  trials,  while  our  results  were  obtained 
with  averaging  the  best  five. 
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Figure  4.  Recognition  error  rates  as  a  function  of  number  of  training  faces.  The  other 
faces  of  10  faces  are  used  for  testing. 


4.  CONCLUSIONS 

We  present  a  DCT-based  adaptive  metric  LVQ  employing  a  mixture  model 
designed  for  learning  from  small  sample  sets  such  as  in  face  identification. 
Adaptive  sizes  of  the  mixture  classes  and  local  metric  parameters  are  derived 
from  the  structural  empirical  risk  minimization  for  a  certain  model  complexity. 
We  extend  this  model  to  provide  a  general  trade-off  between  error  and  complexity 
by  introducing  the  complexity  term  in  MDL.  The  optimal  size  of  the  DCT 
coefficients  is  also  obtained,  achieving  the  lowest  recognition  error  rate  using  the 
ORL  database  reported  so  far. 
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ABSTRACT 

Speaker  verification  is  a  rapidly  maturing  technology  that  is  be¬ 
coming  available  for  commercial  applications.  In  this  paper,  we 
investigate  the  application  of  data  fusion  methods  to  sub- word  im¬ 
plementations  of  speaker  verification.  At  a  sub-word  level,  we  uti¬ 
lize  the  diversity  of  the  information  provided  by  the  neural  tree 
network  and  Gaussian  mixture  model  to  provide  a  more  robust 
sub- word  model.  The  phrase-level  scores  for  each  modeling  ap¬ 
proach  are  obtained  and  then  combined.  The  data  fusion  method 
we  use  for  combining  the  model  scores  is  the  linear  opinion  pool.  In 
addition  to  using  the  diversity  of  the  model  scores,  we  also  apply 
the  concept  of  redundancy  by  using  a  leave-one-out  approach  to 
partition  the  input  data.  This  allows  us  to  generate  several  models 
and  accommodate  the  small  training  sample  issues  imposed  by  our 
specific  applications.  The  theoretical  results  of  the  above  analy¬ 
sis  have  been  integrated  into  a  system  that  has  been  tested  with 
several  databases  that  were  collected  within  landline  and  cellular 
environments.  These  results  are  included  in  this  paper.  We  have 
found  that  the  proper  data  fusion  techniques  will  typically  reduce 
the  error  rate  by  a  factor  of  two. 
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1  INTRODUCTION 

Speaker  verification  consists  of  determining  whether  or  not  a  voice  sample 
provides  sufficient  match  to  a  claimed  identity.  Speaker  verification  has  nu¬ 
merous  applications  in  areas  that  necessitate  the  validation  of  a  person’s  iden¬ 
tity.  For  example,  when  initiating  a  bank  account  transaction  over  the  phone 
or  at  an  automatic  teller  machine  (ATM),  speaker  verification  can  provide 
an  additional  level  of  security  over  personal  identification  numbers  (PINs). 
Also,  speaker  verification  has  the  advantage  over  other  forms  of  biometric  au- 
thentification,  such  as  fingerprint,  retinal  scan,  etc.,  in  that  it  can  be  applied 
over  the  telephone  network.  These  are  some  of  the  characteristics  that  make 
speaker  verification  a  very  attractive  technology  for  numerous  commercial 
applications. 

Speaker  verification  applications  are  generally  text-independent  or  text- 
dependent.  Text-independent  speaker  verification  systems  do  not  require  that 
the  same  text  be  used  for  training  and  testing.  Text-dependent  speaker  veri¬ 
fication  systems  require  that  the  same  text  be  used  during  both  training  and 
testing.  Though  text-independent  systems  may  be  more  convenient  from  a 
user  standpoint,  text-dependent  systems  provide  additional  security  in  that 
they  1)  require  fraudulent  imposter  attempts  to  use  the  same  password,  and 
2)  tend  to  provide  better  performance  than  text-independent  systems.  Text- 
dependent  speaker  verification  systems  will  be  the  focus  of  this  paper. 

In  this  paper,  we  investigate  the  application  of  data  fusion  methods  to 
sub-word  model  implementations  of  text-dependent  speaker  verification.  The 
effects  of  segmentation  for  sub-word  implementations  are  addressed.  Two 
modeling  approaches  are  then  considered  for  score  combination,  namely  the 
neural  tree  network  and  Gaussian  mixture  model. 

This  paper  is  organized  as  follows.  The  following  section  provides  an 
overview  of  the  processing  steps  in  performing  speaker  verification.  This 
overview  includes  a  brief  description  of  feature  extraction,  model  evaluation, 
and  data  fusion.  This  is  followed  by  a  description  of  the  implementation 
details  that  are  specific  to  our  system.  The  experimental  results  for  several 
text-dependent  tasks  are  then  provided.  The  databases  used  for  these  ex¬ 
periments  are  collected  within  both  landline  and  cellular  environments.  A 
summary  of  the  results  is  then  given. 

2  SPEAKER  VERIFICATION 

Speaker  verification  generally  consists  of  feature  extraction  followed  by  model 
construction  and  evaluation.  As  part  of  model  construction  and  evaluation, 
we  will  also  address  the  concept  of  data  fusion  where  the  scores  of  several 
models  are  combined  to  create  a  composite  score.  This  composite  score  will 
be  that  which  is  applied  to  a  threshold  to  yield  the  final  decision.  These  phases 
of  speaker  verification  are  briefly  described  in  the  following  subsections. 


532 


2.1  Feature  Extraction 


Feature  extraction  consists  of  deriving  characteristics  of  the  speech  signal 
that  are  unique  to  an  individual.  The  predominant  characteristic  that  causes 
people’s  voices  to  be  different  from  one  another  is  the  shape  of  the  vocal  tract. 
The  difference  in  the  length  and  cross-sectional  areas  in  the  vocal  tract  from 
person  to  person  results  in  different  resonant  frequencies  and  bandwidths. 
Hence,  most  feature  extraction  routines  for  speaker  recognition  utilize  some 
type  of  spectral  analysis.  Typical  features  are  the  cepstrum  or  variants  of 
it.  Pole-filtered,  mean-removed  cepstrum  [1]  are  the  features  used  in  the 
experimental  results  section.  For  this  feature  set  we  first  obtain  a  channel 
estimate  by  computing  the  pole-filtered  mean  of  the  linear  predictive  (LP) 
cepstrum  of  the  input  speech.  This  channel  estimate  is  converted  to  a  filter 
that  is  applied  to  the  speech  to  inverse  out  the  channel  effect.  Then,  the  LP 
cepstrum  of  the  filtered  speech  is  used  as  the  feature. 

2.2  Modeling 

A  speaker  verification  model  is  constructed  from  feature  data,  specifically 
that  from  a  target  speaker  and  possibly  from  non-target  speakers.  This  model 
should  have  the  ability  to  provide  a  level  of  match  to  the  target  speaker  when 
given  a  new  set  of  feature  data.  For  text-dependent  speaker  verification,  a 
model  should  capture  the  temporal  information  in  addition  to  the  acoustical 
information.  The  standard  models  that  accomplish  this  are  hidden  Markov 
models  (HMMs)  and  dynamic  time  warping  (DTW).  In  general,  segment- 
based  approaches  to  speaker  verification  maintain  temporal  information.  An¬ 
other  important  piece  of  information  for  model  construction  or  evaluation  is 
data  that  is  not  from  the  target  speaker,  or  “non-target”  data.  One  method 
for  incorporating  this  information  is  used  during  model  evaluation  and  is 
known  as  cohort  normalization  [2].  Another  method  is  to  use  non-target  data 
during  training,  which  can  be  accomplished  by  using  discriminative  training 
approaches  [3]  or  neural  networks  [4], 

The  modeling  approach  here  is  based  on  the  neural  tree  network  (NTN) 
and  Gaussian  mixture  model  (GMM).  The  NTN  [5]  is  a  hierarchical  classifier 
that  uses  a  tree  architecture  to  implement  a  sequential  linear  decision  strat¬ 
egy.  The  NTN  has  been  evaluated  for  text-independent  speaker  verification 
[4],  whole- word  based,  text-dependent  speaker  verification  [6],  and  sub- word 
based,  text-dependent  speaker  verification  [7,  8].  Data  fusion  methods  were 
considered  for  whole- word  NTN  models  with  dynamic  time  warping  [6,  9]. 
In  this  paper,  we  evaluate  data  fusion  methods  for  sub-word  NTN  models 
combined  with  Gaussian  mixture  modeling,  which  is  also  a  popular  model  for 
speaker  verification  [10]. 
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2.3  Data  Fusion 

Data  fusion  methods  can  take  advantage  of  the  concepts  of  diversity  and  re¬ 
dundancy  to  improve  system  performance.  Diversity  can  be  used  to  improve 
system  performance  through  the  incorporation  of  different  information.  Sim¬ 
ilarly,  redundancy  can  achieve  the  same  goals  through  the  re-use  of  data. 
These  concepts  have  been  thoroughly  explored  in  the  field  of  co^umcations 
and  have  also  been  applied  to  pattern  recognition  problems.  The  basic  idea 
is  that  if  several  models  can  be  constructed,  whose  errors  are  mutually  un¬ 
correlated,  then  performance  advantages  can  be  obtained  through  the  proper 

combination  of  the  model  scores.  .11  1  .rj 

The  combination  of  different  sources  of  information  has  been  explored 
within  a  field  known  as  data  fusion.  A  comparison  was  done  between  several 
data  fusion  techniques,  including  the  linear  and  log  opinion  pools  [llj,  and 
voting  [12]  for  a  speaker  verification  application  [6].  This  comparison  showed 
the  simplest  method,  namely  the  linear  opinion  pool,  to  do  at  le^t  ^  well 
as  the  other  methods.  Hence,  the  linear  opinion  pool  will  be  considered  here. 
The  linear  opinion  pool  is  evaluated  as  a  weighted  sum  of  the  outputs  for 
each  model:  „ 

PlineA^)  = 

i=l 

where  PuneaM  is  the  probability  of  the  combined  system,  at  are  weights 
k(x)  is  the  probability  output  by  the  i"-  model,  and  n  is  the  number  of 
models.  For  all  experiments  in  this  paper,  is  between  zero  and  one  and 
the  sum  of  the  CKi’s  is  equal  to  one. 

3  SPEAKER  VERIFICATION  SYSTEM 

The  speaker  verification  system  used  in  this  paper  is  known  ^  the  T-NETIX 
SpeakFZ  Voice  system.  This  system  is  text  dependent  and  utilizes 

sub-word  NTN  and  GMM  models,  along  with  vocabulary-independent  pass¬ 
word  selection  and  data  fusion.  The  vocabulary-independent  password  selec¬ 
tion  is  enabled  through  a  technique  known  as  blind  segmentation  [13].  The 
blind  segmentation  algorithm  will  automatically  determine  the  number  of  seg¬ 
ments  and  segment  boundaries  for  a  password  without  the  use  of  transcription 
information.  The  NTN  and  GMM  scores  for  each  subword  are  accumulated 
to  form  the  phrase-level  score  for  each  model  type.  ,  ,  . 

Additionally,  a  leave-one-out  strategy  is  deployed  to  utilize  the  data  re¬ 
dundancy  in  addition  to  facilitating  threshold  selection.  Basically,  fra  N 
enrollment  repetitions  of  a  password,  there  will  be  N  separate  models.  Each 
model  is  trained  with  N  -  1  repetitions  with  a  different  repetition  left-out 
for  each  model.  The  left-out  repetition  can  then  be  applied  to  the  model 
to  yield  an  unbiased  target  speaker  score  that  can  be  used  in  setting  the 
threshold  for  speaker  acceptance/rejection. 
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Figure  1 :  Tredning  a  speaker  model 


The  procedure  to  train  a  model  for  a  given  speaker  is  illustrated  in  Figure 
1.  The  multiple  repetitions  of  the  speaker’s  password  are  used  by  the  seg¬ 
mentation  module  to  estimate  the  number  of  subwords  in  the  password  along 
with  the  subword  boundaries.  The  mean  vector  and  diagonal  covariance  ma¬ 
trix  of  the  subword  segments  are  obtained  as  by-products  of  the  segmentation 
module.  These  are  used  as  the  GMM  component  of  the  speaker  model.  For 
each  subword  segment  of  the  password,  a  NTN  model  is  also  trained.  The 
closest  subword  segments  from  other  speakers  who  are  already  enrolled  in  the 
database  are  used  as  non-target  data  for  training  these  subword  NTN  models. 

The  procedure  to  verify  a  claimed  identity  is  illustrated  in  Figure  2.  The 
given  testing  utterance  is  segmented  to  the  optimal  number  of  segments  de¬ 
termined  during  training.  The  subword  segment  vectors  are  scored  using 
the  appropriate  subword  NTN  and  GMM  models.  The  scores  of  these  sub¬ 
word  segments  are  averaged  and  a  composite  score  for  the  entire  phrase  is 
obtained.  The  phrase-level  NTN  and  GMM  scores  are  then  fused  together 
using  the  linear  opinion  pool.  We  have  performed  experiments  that  did  not 
show  any  advantages  by  combining  this  information  at  the  subword  level.  If 
multiple  models  are  obtained  during  training  using  the  leave-one-out  method, 
then  all  these  models  are  scored  in  the  above  manner.  These  model  scores 
are  averaged  to  yield  the  final  output  score. 
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Figure  2:  Testing  a  claimed  identity 

4  EXPERIMENTAL  RESULTS 

The  T-NETIX  SpeakEZ  Voice  Print^^  system  is  evaluated  with  three  toll 
quality  speech  corpora  that  were  collected  by  T-NETIX.  The  first  database 
is  known  as  the  “names”  database.  The  names  database  consists  of  10  male 
target  speakers,  each  with  three  enrollment  utterances  of  their  full  name.  The 
imposter  attempts  are  comprised  of  the  remaining  nine  speakers  and  all  use 
the  correct  password.  The  second  database  is  known  as  the  “open  sesame” 
database.  This  database  consists  of  56  enrolled  speakers  and  47  separate  non¬ 
target  speakers.  Each  speaker  enrolled  with  the  phrase  “open  sesame”,  hence, 
this  scenario  reflects  a  fixed-text  situation.  The  third  database  is  known  as 
the  “cellular”  database.  This  database  is  also  a  fixed-text  application  that 
uses  the  password  “A1  Capone”  for  all  speakers.  This  database  was  collected 
using  cellular  phones  and  consists  of  26  evaluation  speakers  and  and  15  non¬ 
target  speakers.  The  aspects  of  each  database  are  summarized  in  Table  1. 
The  non-target  speakers  column  in  Table  1  refer  to  the  development  set  that 
is  used  during  training  of  a  speaker  model.  To  avoid  bias  in  the  results,  the 
development  speakers  are  not  used  as  imposters  during  the  actual  testing. 
The  evaluation  speakers  are  used  to  measure  the  actual  system  performance. 

The  first  experiment  evaluates  the  system  equal  error  rate  as  a  function 
of  the  number  of  segments.  Generally,  the  system  computes  the  number 
of  segments  per  password,  but  in  this  case,  we  have  forced  the  number  of 
segments  to  be  constant  for  all  speakers.  The  results  of  this  experiment  as 
performed  on  the  names  database  are  shown  in  Figure  3.  It  is  clear  from 
Figure  3  that  the  GMM  requires  several  segments  before  the  performance 
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EER  versus  segment  number 


Figure  EER  versus  number  of  segments 


starts  to  become  competitive.  The  performance  of  the  NTN,  however,  starts 
to  degrade  as  the  number  of  segments  increases  beyond  four  or  five  segments. 
This  is  due  to  the  fact  that  the  number  of  data  samples  per  NTN  decreases  as 
the  number  of  segments  increases.  Hence,  for  the  NTN  the  lack  of  data  starts 
to  overcome  the  benefits  of  decomposing  the  acoustic  space  of  the  password. 

The  next  experiment  evaluates  the  equal  error  rate  as  a  function  of  alpha 
for  the  linear  opinion  pool  method  of  data  fusion.  The  system  uses  a  variable 
number  of  segments  per  speaker.  The  results  of  this  experiment  for  the  names 
database  are  shown  in  Figure  4.  Here,  it  can  be  seen  that  the  individual 
performance  of  the  GMM  and  NTN  is  3.2%  and  3.4%,  respectively.  However, 
by  combining  the  results  of  these  methods,  the  EER  can  be  reduced  to  1.6%. 

This  experiment  was  also  evaluated  with  the  “Open  Sesame”  and  “cellu¬ 
lar”  database  and  the  results  for  these  experiments  are  shown  in  Figures  5 
and  6,  respectively.  The  results  for  the  “Open  Sesame”  database  show  the 
individual  performance  of  the  NTN  and  GMM  to  be  1.6%  and  2.3%,  respec¬ 
tively,  whereas  the  performance  of  the  fused  output  is  0.9%.  For  the  cellular 
“A1  Capone”  database  the  individual  performance  of  the  NTN  and  GMM  is 
11.8%  and  10.2%,  respectively,  while  the  performance  of  the  fused  output  is 
8.2%. 

The  experimental  results  for  T-NETIX’s  SpeakEZ  Voice  Prini^^  system 
are  tabulated  for  the  “names”,  “Open  Sesame”  and  “cellular”  databases  in 
Table  1.  The  results  in  this  table  reflect  the  fusion  results  when  a  =  0.5. 
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Cellular  "Al  Capone*  database;  Equal  error  rate  versus  alpha 


Figure  6:  Linear  opinion  pool  for  “Cellular”  database 


5  CONCLUSION 

The  T-NETIX  SpeakEZ  Voice  Print^^  system  is  evaluated  for  several  text- 
dependent  speaker  verification  tasks.  These  include  applications  in  both  cellu¬ 
lar  and  landline  environments.  The  T-NETIX  SpeakEZ  Voice  Print^^  system 
does  not  have  any  constraints  on  the  vocabulary  from  which  the  password  is 
selected.  This  is  accomplished  through  the  use  of  sub-word  neural  tree  net¬ 
works  and  a  blind  segmentation  algorithm  that  does  not  require  phonetic  label 
information.  In  addition,  the  system  utilizes  concepts  within  data  fusion  to 
capitalize  upon  different  modeling  approaches  whose  errors  are  uncorrelated. 
The  data  fusion  techniques  are  found  to  reduce  the  error  rate  by  a  factor 
of  two  for  the  landline  databases.  The  error  rate  for  the  cellular  database 
is  reduced  by  20%.  The  error  rates  for  the  landline  and  cellular  databases 


Password 

#  development/ 

#  true/imposter 

Performance 

text 

evaluation  speakers 

trials 

(EER) 

47/56 

195/11,229 

Own  full  name 

80/10  males 

100/450 

“Al  Capone” 

15/26 

273/6825 

8.2  %  1 

Table  1 :  Performance  for  the  SpeakEZ  Voice  Print^^  system 
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are  roughly  1-2%  and  8%,  respectively.  We  find  these  results  very  encour- 
aging  given  the  constraints  of  limited  training  repetitions,  short  enrollment 
utterances,  and  unconstrained  vocabulary  for  password  selection. 
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Abstract  -  A  chaotic  annealing  neural  network  model  based  on  transient  chaos 
and  dynamic  gain  is  proposed  for  solving  optimization  problems  with  continuous- 
variables  such  as  maximal  likelihood  estimation  of  spatial  signal  sources  in  this 
article.  Compared  to  conventional  neural  networks  only  with  point  attractors, 
the  proposed  neural  network  has  richer  and  more  flexible  dynamics  which  are 
expected  to  have  higher  ability  of  searching  for  globally  optimal  or  near-optimal 
solutions.  After  going  through  an  inverse-bifurcation  process,  the  neural  network 
gradually  approaches  to  a  conventional  Hopfleld  neural  network  starting  from  a 
good  initial  state.  Numerical  simulations  show  both  the  effectiveness  escaping 
from  local  minima  and  the  abiUty  solving  for  nonlinear  maximal  likelihood 
estimation  of  spatial  sources  of  the  proposed  network 


L  INTRODUCTION 

In  many  branches  of  science  and  technology  one  often  encoun^^jp  difficult 
optimization  problems  which  have  intractable  computational  complexity  .  For  these 
problems  there  is  a  large  but  finite  set  of  possible  solutions,  among  which  we  desire  to 
find  the  one  which  globally  minimizes  the  cost  fimction  involved.  Typical  examples 
are  the  knapsack  problem  and  travehng  salesman  problem  (TSP).  Through  the 
pioneer  work  in  [2],  the  collective  computational  properties  of  the  Hopfield  neural 
network  (HNN)  for  seeking  a  stable  equilibrium  can  be  utilized  in  solving  many 
difficult  optimization  problems.  The  main  difficulty  in  solving  many  actual 
optimization  problems  using  HNN  is  that  the  network  tends  to  become  trapped  in 
local  m^ma  due  to  its  gradient  descent  dynamics.  To  ^^oid  getting  stuck  in  local 
minima  ,  both  stochastic  simulated  annealing  (SSA)  approand  deterministic 
simulated  annealing  (DSA)  techniques  have  been  proposed  and  are  combined  with 
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neural  networks.  Typical  examples  of  neural  networks  with  SSA  function  are 
Boltzmann  machines  and  Gaussian  machines.  On  the  other  ha^  various  DSA 
approaches,  such  as  hardware  annealing  (namely  gain  sharpening)  in  the  Ig^pfield 
networks  and  cellular  neural  networks  and  mean  field  approximate  annealing  ,  have 
been  proposed  and  apphed  to  neural  networks. 

Recently,  a  number  of  artificial  neural  networks  with  chaotic  dynamics  have  been 
extensively  investigated  because  of  their  more  complex  neurodynamics.  Unlike  the 
conventional  networks  only  utihzing  gradient  descent  dynamics,  the  neural  networks 
with  chaotic  dynamics  have  richer  and  far-from  equihbrium  dynamics  with  various 
coexisting  attractors,  not  only  of  fixed  points  and  periodic  points  but  also  of  strange 
attractors.  This  kind  of  complicated  neurodynamics  is  a  promising  technique  for 
information  processing  and  optimization.  In  particular,  an  intriguing  property  of 
chaotic  neural  network  to  move  chaotically  over  fiactal  structure  in  the  phase  space 
may  be  an  efficient  heuristic  method  searching  for  global  optimal  or  near  global 
optimal  point,  avoiding  getting  stuck  at  local  minima.  The  maximum  difficult 
happened  for  the  use  of  the  chaotic  characteristics  is  to  decide  when  to  terminate  the 
chaotic  dynamics,  or  how  to  harness  chaotic  behavior  for  convergence  to  a  stable 
equilibrium  point  corresponding  to  an  acceptably  near-optimal  state. 

In  order  to  make  full  use  of  the  advantages  of  both  chaotic  neurodynamics  and 
conventional  gradient  descent  (or  convergent)  neurodynamics,  this  article  proposes  a 
neural  network  model  with  transient  chaos  and  dynamic  gain  for  nonlinear 
optimization  solving  problem.  This  neural  network  is  expected  to  have  higher  abihty 
of  searching  for  the  global-optimal  or  near-optimal  solution.  The  characteristics  of  the 
proposed  network  are  analyzed  and  examined  by  numerical  simulations  in  details. 


n.  A  NEURAL  NETWORK  WITH  TRANSIENT  CHAOS  AND  TIME- 
VARIANT  GAIN 

It  is  well-known  that  Hopfield  network  with  continuous-time  or  asynchronously 
discrete-time  state  transitions  guarantee  convergence  to  a  stable  equihbrium  solution 
but  suffer  from  local  minimum  problems.  Since  the  chaotic  neural  network  is  of 
richer  and  more  flexible  neuro(^Tiamics  whose  running  region  is  only  a  fiactal 
structure  in  the  phase  space  and  may  be  used  to  efficiently  escape  fi:om  local  minima 
problem  in  chaotically.  Therefore,  in  order  to  obtain  a  global  optimal  or  near-optimal 
convergent  solution  for  nonlinear  optimization,  we  ingeniously  combine  the  chaotic 
dynamics  with  convergent  (fynamics  and  propose  a  new  chaotic  neural  network 
model  based  on  transient  chaos  and  time-variant  gain  (NNTCTG),  as  defined  below: 


542 


>>i(f  +  l)  =  fy,(/)+a  Z  ^ijXj{t)  +  h 


z.(f  +  l)  =  (l-p)z,(r) 
6,(?+i)  =  (i-r)E,(f) 


(!  =  l,2,-",n) 


(2) 

(3) 

(4) 


where  X,  and  /.  are  output,  internal  state  and  input  bias  of  neuron  /, 
connection  weight  from  neuron  j  to  neuron  /,  a  =  positive  scaling  parameter  for 
inputs,  k  =  damping  fector  of  nerve  membrane  (O  <  /:  <  l) ,  z,  (f)  =  self-feedback 

connection  weight  (z,(f)>0),  0  =  damping  fector  of  the  time-dependent 

z.  (r) ,  (0  <  P  <  1) ,  s .  (r)  =  gain  parameter  of  the  output  function  (s .  (f)  >  0  ),  Y  = 
damping  fector  of  the  time-dependent  e  .  (^) ,  (O  <  y  <  l) . 

In  (l)-(4),  if  E.(/)  equals  to  a  big  positive  constant  e,  and  z,(^)  is  a  positive 
constant  z,  and  /^  =  0  ,  then  the  NNTCTG  is  reduced  as  the  chaotic  neur^ 
network  (CNn)  proposed  in  [7].  So  the  NNTCTG  is  regard  as  a  generahzed  CNN 
and  is  expected  to  has  some  similar  chaotic  phenomenon  of  complicated  bifurcation 
structures  with  CNN.  By  introducing  the  time-dependent  variables  z,  (/)  and  e,  (r) , 

the  chaotic  dynamics  of  NNTCTG  can  be  reasonably  harnessed  and  highly  accurate 
steady-state  solution  for  nonhnear  optimization  problem  can  be  obtained  as  well  as 
long  as  we  choose  appropriately  the  parameters  in  NNTCTG.  This  will  be  shown  by 
numerically  next  section. 

The  term  z,(r)(x,  (r)-/o)  in  (2)  is  related  to  inhibitory  self-feedback  or 

refractoriness  and  is  the  main  factor  generating  chaotic  phenomenon.  It  can  be  shown 
that  NNTCTG  actually  has  transiently  chaotic  dynamics  which  eventually  converges 
to  a  stable  equilibrium  point  through  successive  bifurcations  like  a  route  of  reversed 
period-doubling  bifurcations,  with  the  temporal  evolution  of  Zj  (^)  and  E,  (f)  by  in 

(3).  Variables  z-(/)  and  e  .(r)  corresponding  to  the  temperature  in  usual  stochastic 

simulated  anneahng  process  in  exponential  coohng  schedule  harness  the  chaotic 
behavior  for  convergence  and  the  speed  of  reversed  bifurcation.  Actually,  the 
damping  of  z,(f)  and  E,(f)  produces  successive  bifurcations  so  that  the 

neurodynamics  eventually  converge  from  strange  attractors  to  a  stable  equihbrium 
point. 

Comparing  with  HNN,  the  NNTCTG  has  an  additional  nonlinear  time-dependent 
damping  term.  As  the  self-feedback  cormection  weights  Zj  (?)  tend  toward  zero  with 
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time  evolution  in  the  form  of  ^,(^)  —  the  NNTCTG  defined  in  (l)-(4) 

eventually  reduce  to  the  continuous-time  HNN  which  had  been  extensively  used  to 
solve  many  difiicult  optimization  problem  in  many  domains.  On  the  other  hand,  with 
the  temporal  evolution  of  gain  parameter  8,.(t)  simultaneously,  the  convergent 

NNTCTG  after  a  reversed  period-doubling  biturcation  process  gradually  approaches 
to  a  HNN  with  desired  gain  value,  which  is  very  suitable  for  accurate  nonlinear 
optimization  solving  of  continuous-variables. 

According  to  above  discussions,  it  is  clear  that  the  procedure  of  NNTCTG  in  solving 
for  nonlinear  optimization  problem  is  able  to  be  divided  into  two  phases:  chaotic 
bifurcation  phase  and  gradient  convergent  phase.  In  the  first  phase,  a  comphcated  and 
rich  chaotic  bifurcation  process  is  created  by  big  values  of  refiuaoriness  and  gain  for 
the  network  to  escape  from  local  minima,  whose  mechanics  can  be  regarded  as  a  kind 
of  DSA  and  called  chaotic  simulated  annealing  (CSA).  After  that,  a  good  initial  state 
at  a  neighborhood  of  globally  optimal  solution  is  provided  for  the  gradient  descent 
dynamics  of  the  second  phase  of  NNTCTG  so  that  the  network  can  easily  reach  the 
global  c^timal  or  near-optimal  solution  of  the  problem. 


in.  NEURODYNAMICS  ANALYSIS  OF  NNTCTG 

In  order  to  analyze  and  examine  the  nonlinear  dynamics  of  the  proposed  NNTCTG, 
we  use  it  to  solve  a  concrete  optimization  problem  whose  objective  function  is  defined 
as 


£(x,.  X2)  =  (x,  -  0.7)^((:c2  +0.6f  +0.lj  +  (*2  -0,5)^((x,  +0.4f  +O.I5).  (5) 

Point  (0.7, 0.5)  is  the  global  minimum  while  the  points  (0.6, 0.4),  (0.6, 0.5)  and  (0.7, 
0.4)  are  local  minima  in  the  landscape  of  energy  function  of  (5). 

Let  us  set  the  values  of  the  network  parameters  in  (l)-(4)  as 

*  =  1.0;  e(0)  =  [230,  230];  /o  =  0.5;  z(0)  =  [0.082,  0.082],  (6) 

By  Letting  (i  =  l,2),  the  objective  function  of  the 

concrete  optimization  problem  can  be  transformed  as  the  energy  function  of  the 
corresponding  NNTCTG.  Through  this  transformation  our  proposed  NNTCTG  can 
be  formed  to  solve  this  kind  of  optimization  problems  and  can  evolve  from  a  given 
initial  state. 
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ITERATION  TIMES  ITERATION  TIMES 

Fig.  1  Time  evolutions  ofX(t)  and  the  Lyapunov  exponent  A,  in  the  dynamics  of  the  NNTCTG 

Fig.l  shows  the  time  evolutions  of  and  their  Lyapunov  exponent  ^  , 

which  are  calculated  by  (1H4)  with  ^=0.01,  Y  =0.01  and  a  =0.015  and  given 
initial  state  y(0)=[-0.2828,  -0.0461], 

The  Lyapunov  exponent  A  for  (f)  and  X2  (?)  is  calculated  by 


1 

X,=  Urn  ~yin 


k^O 


clx^[k  +  ^ 


dXi{k) 


i  =  l,  2 


(7) 


The  positive  values  of  A  in  Fig.l  indicate  that  NNTCTG  actually  have  chaos  during 
the  first  110  iterations  for  Xi(?)  and  X2(f)  in  this  problem.  After  that,  the  network 
enters  into  its  gradient  convergent  stage  since  the  Lyapunov  exponents  are  negative 
values.  As  z-(?)  and  S-(?)  are  damped  in  exponential  schedule  simultaneously, 

Fig.l  shows  clearly  that  the  neuron  outputs  x:i(?)  and  X2{f)  gradually  transit  from 

chaotic  behavior  to  fixed  (or  steady)  values  through  a  reversed  period-doubling 
bifurcation  process.  In  other  words,  the  proposed  NNTCTG  has  transiently  chaotic 
dynamics  and  almost  coincides  with  the  HNN's  dynamics  when  the  values  of  time- 
dependent  Zj(?)  and  6j(?)  decreases  enough.  This  supports  our  discussion  last 

section.  The  fixed  output  of  the  network  is  just  the  global  minimizer  (0.70,  0.50)  after 
185  iterations. 
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Fig.2  Multiple  minima  in  energy  function  of  (5)  for  two-neuron  NNTCTG.  (a)(b)  Energy 
contour  and  trajectories  of  network  outputs  from  initial  states  (0.5631,-0.3461)  and 
(0.0337,0.8453),  respectively.  (c)(d)  Corresponding  energy  functions  E(t)  during  network 
evolutions  for  the  two  cases. 
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Fig  2  shows  contour  plots  of  the  energy  function  and  trajectones  of  output  values  of 
NNTCTG  when  network  parameters  are  set  as  above  except  for  the  imtial  states 
(0.5631,-0.3461)  and  (0.0337,0.8453).  The  outputs  initially  located  at  a-(-1.0,  -1.0) 
and  b=(0.9992,1.0)  follow  the  individual  trajectories  when  the  network  is  allowed  to 
evolve  After  a  shortly  bifurcation  process,  the  steady-state  outputs  of  the  network  for 
location  a  and  location  b  are  both  (0.7, 0.5)  which  corresponds  to  the  global  immmum. 
It  is  also  shown  that  the  first  stage  of  each  trajectory  is  a  temporal  process  of  revers^ 
period-doubling  bifurcations  whose  searching  regions  in  the  state  space  shown  by 
points  in  Fig.2(a)  and  (b),  which  can  be  at  anywhere  in  whole  state  space  m 
unpredictable  or  chaotic  manner,  is  restricted  to  a  small  possibly  fractal  stmcture 
whose  volume  should  be  zero  with  respect  to  the  Lebesgue  measure  of  the  whole  state 
space.  So  CSA  is  much  more  efficient  in  computation  than  the  famous  SSA  which 
needs  to  search  the  whole  state  space.  Fig.2(c)  and  (d)  also  show  that  the  energy 
fimctions  corresponding  to  the  two  trajectories  are  also  follow  the  same  two  phases, 
namely  a  fixed  value  (^obal  minimum)  is  approached  in  gradient  descent  manner 
after  a  reversed  bifurcations .  u  ■ 

When  we  use  the  HNN  to  solve  the  problem  with  same  conditions,  the  imtial 
location  a  and  b  result  in  the  local  minimizer  in  the  steady-state  outpute  whose 
trajectories  are  also  shown  in  Fig.2  by  dashed-line.  From  this  attempt,  it  is  clearly 
seen  that  the  proposed  NNTCTG  with  intrinsic  CSA  mechanics  has  much  stronger 
ability  searching  for  global  optimal  or  near-optimal  solution  of  nonlinear  optimization 
than  conventional  HNN. 

In  the  sequel  we  vary  and  Y  to  investigate  the  dynamics  of  the  NNTCTG  while 
other  parameters  are  fixed  as  in  (6). 

Fig.3  shows  the  time  evolutions  of  neuron  output  Xj(f)  with  different  values  of 

damping  ^  and  Y .  As  shown  in  Fig. 3(a)  and  (c),  with  Y  =0.001  or  0.005  and  3 
=0.01,  the  parameter  Y  controls  the  bifiircation  speed  of  transient  chaos  and  the 
accuracy  of  the  steady-state  solutions.  On  the  other  hand,  as  was  shown  in  Fig.3(b) 
and  (d),  when  y  =0  and  P  =0.01  or  0. 1 ,  after  shortly  bifurcation,  the  output  of  (/) 

becomes  an  oscillation  and  can  not  be  stabilized  at  a  fixed  point  at  le^t  at  2000 
iterations  done  for  the  problem.  In  other  words,  the  NNTCTG  without  time-variant 
gain  can  not  be  used  to  solve  nonlinear  optimization  with  continuous  variables. 
Therefore,  the  damping  factor  Y  of  time-variant  gain  is  a  key  parameter  in 
NNTCTG  which  not  only  governs  the  bifurcation  speed  of  the  transient  chaos  and 
controls  the  solving  accuracy  of  the  network,  but  also  trades  off  the  gain  requirements 
of  both  transiently  chaos  and  steacfy-state  HNN.  In  addition,  the  coefficient  a  is  a 
balance  parameter  between  the  chaotic  dynamics  contributed  by  time-de^ndent  term 
in  (2)  and  convergent  dynamics  by  the  gradient  term  of  energy  function  in  (2),  which 
should  be  chosen  appropriately  in  practical  applications. 
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Fig.3  Trajectories  of  neuron  xl  with  different  values  of  damping  factors  of  ^  and  y 

In  order  to  further  examine  the  ability  of  obtaining  global  and  near-global  solution 
for  complex  nonlinear  optimization  problem,  we  compute  (5)  with  10000  different 
initial  states  and  list  the  results  in  Table  1.  For  convenient  comparison,  we  also  give 
the  statistical  results  of  HNN  for  the  same  conditions  in  Table  1.  It  can  be  seen  that 
NNTCTG  is  always  able  to  obtain  the  global  optimal  solution  every  time,  but  HNN 
can  obtain  the  global  optimal  solutions  for  only  238  times  because  of  its  greedy 
gradient  descent  manner.  Average  iteration  for  convergence  of  NNTCTG  is  195 
iterations  which  is  about  one  quarter  time  than  that  of  HNN. 

TABLE  1  RESULTS  OF  10000  DIFFERENT  INTTIAL  CONDITIONS  ON  OPTIMIZATION 
PROBLEM  AS  (6)  BY  NNTCTG  AND  HNN 


Neural  Network  Model 

NNTCTG 

HNN  1 

Rate  of  global  minima  (%) 

■hliMiHIW 

Rate  of  local  minima  (%) 

0(  0%) 

9762(97.46%) 

Average  iterations  for  convergence 

195 

749 

IV  APPLICATION  TO  DOA  ESTIMATION  OF  SPATIAL  SOURCES 

Recently  highly  accurate  direction  of  arrival  (DOA)  estimations  of  spatial  signal 
sources  have  been  studied  extensively  and  find  applications  in  radar,  communications, 
sonar,  geophysical  imaging  and  so  on.  Typical  techniques  are  maximum  likelihood 
(ML),  MUSIC,  minimum  variance,  propagating  operator  and  ESPRIT,  etc..  Although 
the  ML  method  among  them  provides  an  optimum  solution,  it  is  not  as  prevailing  as 
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so  called  suboptimal  methods  mainly  because  of  its  high^gpiputational  complexity 
for  optimizing  a  nonlinear  likelihood  ftmction  expressed  as 


min  V  V 


1^,(0- <!>.]} 


(8) 


where  d  is  the  interval  between  array  units,  X  is  signal  frequency,  P  denotes  the 
number  of  signal  sources,  N  is  the  number  of  array  units,  M  is  the  number  of  flash 
shots.  X  (/)  represents  the  received  data  of yth  unit  in  /th  flash  shot.  Variable  0^  is 

the  direction  of  the  /th  signal  source,  (p,  is  the  initial  phase  and  5,.(/)  is  the 
amplitude  of  signal  sources. 


^  o 


- 

B  I=0.3ez8 

^  B  1=D.0013 

100  200  300  ^OO  500 


£E 

2 

-S 

o 

OB 

M 

^  o 


B  2=0.5025 

Zi  B  2=0. 0040 

1  _ _ _ _ 1 

100 

200  300  400  5C 

B  3=b  .8eZ5 

s 

^  B  3=0.0052  - 

I.  -  .  _ _ I 

1 00  200  300  400  500 

Iterations 


Fig4  Tr^ectoriesoftheiietwoikforestiniatingthtee  signal  sources 

When  to  apply  the  parallel  massive  computational  ability  of  Hopfield-like  neural 
networks  to  this  ML  estimation  in  real  time,  it  suffers  from  local  minima  problems. 
Here  we  use  the  proposed  NNTCTG  to  solve  this  problem.  Through  appropriate 
transformation,  equation  (8)  can  be  transformed  as  the  energy  function  of  NNTCTG 
which  can  be  used  to  solve  the  ML  DOA  estimation  in  hand. 

Example:  DOA  estimation  for  three  narrow  passband  signal  sources  whose 

incoming  angles  are  21  ,  32  and  50  ,  respectively.  SNR  is  20dB.  N-5.  The 
network  parameters  are  chosen  k  -  1.0;  e(0)  =  280;  Iq  =  0.5;  z(0)  =  0.082. 
P  =0.001 ,  Y  =0.04  ,  a  =0.009.  Fig.4  shows  the  trajectories  of  the  proposed 
network  for  the  three  spatial  signal  sources.  It  can  be  seen  that  the  global  optimal 
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solution,  which  is  hard  to  be  obtained  by  the  conventional  HNN-like  methods,  is 
easily  reached  by  our  neural  networks. 


IV.  CONCLUSIONS 

In  this  paper,  we  proposed  a  chaotic  annealing  neural  network  for  accurately  solving 
nonlinear  optimization  with  continuous  variables  and  applied  it  to  ML  estimation  of 
spatial  sources.  It  is  an  ingenious  combination  of  chaotic  dynamics  and  convergent 
(fynamics  whose  intrinsic  CSA  mechanics  has  much  higher  ability  to  search  for 
globally  optimal  or  near-optimal  solutions  than  conventional  HNN  and  has  higher 
computational  efficiency  than  SSA.  Numerical  results  have  been  given  to  examine 
and  demonstrate  the  merit  of  the  proposed  neural  networks. 
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ABSTRACT 

In  this  work,  we  propose  an  neural  network  equalizer  with  a  fuzzy 
decision  learning  rule  based  on  the  generalized  probabilistic  descent 
algorithm  with  the  minimum  decision  error  formulation.  Then,  the  neural 
network  use  the  multi-layer  perceptron.  It  is  shown  that  the  decision  region 
overlapped  by  noise  can  be  overcome  by  the  use  of  a  fuzzy  decision  learning 
rule  based  on  the  generalized  probabilistic  descent  algorithm.  We  apply  this 
algorithm  to  neural  network  equalizer  with  binary  sequences  in  nonlinear 
distortion  channel.  Simulation  results  confirm  that  the  fiizzy  decision 
learning  algorithm  works  more  effectively  than  the  hard  decision  learning 
algorithm  when  the  learning  patterns  are  not  separable  by  high  additive 
noise. 


1.  INTRODUCTION 

In  the  digital  communication  equalization  is  an  technique  to  reduce  the 
influence  of  channels  with  nonlinear  distortion  such  as  amplitude  and  delay 
distortion.  Application  of  neural-network  techniques  and  fuzzy  logic  techniques 
to  channel  equalization  leads  to  a  better  performance  compared  to  the  linear 
equalizer  with  inverse-filtering  formulation  [1-4].  The  channel  equalization 
based  on  neural  network  can  be  viewed  as  a  classification  problem  in  a 
geometric  setting  where  an  equalizer  is  constructed  as  a  decision  making  device 
to  reconstructed  the  transmitted  symbol  sequence. 

The  decision-based  learning  rule  is  effective  for  clearly  separable  decision 
boundary.  When  overlapping  regions  occurs  due  to  the  noise  at  the  decision 
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boundaiy,  however,  neural  network  equalizer  suffer  from  poor  discrimination. 
Such  problem  is  well  known  in  pattern  and  speech  recognition  [5,6].  To  solve 
the  problem,  we  need  a  technique  based  on  a  somewhat  fuzzy  decision  appears  to 
be  more  suitable. 

In  this  work,  we  propose  an  neural  network  equalizer  with  a  fuzzy  decision 
learning  rule  based  on  the  generalized  probabilistic  descent  algorithm  with  the 
minimum  decision  error  formulation  [6J.  This  learning  algorithm  incorporates  a 
penalty  criterion  into  the  conventional  method  with  hard  decision.  Then,  a 
penalty  function  treat  the  errors  with  equal  penalty  once  the  magnitude  of  errors 
exceeds  certain  threshold.  The  neural  network  for  equalizer  use  the  multi-layer 
perceptron  (MLP).  We  apply  this  algorithm  to  neiual  network  equalizer  with 
binary  sequences  in  nonlinear  distortion  channel. 


2.  NEURAL  NETWORKS  AS  CHANNEL  EQUALIZER 

The  transmitted  data  sequence  x{t)  is  assumed  to  be  an  independent  sequence 
taking  values  from  (-7,  7}  with  an  equal  probability.  The  channel  output  o{t)  is 
corrupted  by  an  additive  noise.  The  task  of  the  equalizer  at  sampling  instant  t  is 
to  produce  an  estimate  of  the  input  symbol  x{t-n)  using  the  channel  output 
vector  =  where  the  integer  m  and  n  are  known  as  the 

order  and  the  delay  of  the  equalizer,  respectively.  For  describing  a  geometric 
formulation  of  the  equalization  problem,  define 

Pn,.A-t)  =  {5(0  e  R'"\x{t -n)  =  -l] 

where  7?”  is  the  m-dimensional  Euclidean  space  and,  7^„(7)  and  7^„(-7) 
represent  the  two  sets  of  possible  channel  noise-free  output  vectors  o{t)  that  can 
be  produced  from  sequences  of  channel  inputs  containing  x{t~n)  =  I  and 
x(t-n)=~h  respectively.  It  is  also  clear  that,  if  the  states  of  jcfrX  x(t‘tj)  are 
finite,  d{t)  can  be  only  take  finite  values. 

If  the  distribution  of  the  noise  is  provided,  the  conditional  density  fimction  of 
observing  the  channel  output  vector  o{t)  given  »(r)  e75„„(7)and 

o(t)  e  7^,„(-7),  respectively,  are  completely  specified.  Denoting  two  conditional 

(knsity  functions  as  and  respectively,  the  equalizer 
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can  be  characterized  by  the  function 


x(t-rt)  =  sgn{fj,{o(t))) 


achieves  the  minimum  bit-error-ratio  (BER),  where 


sgn(y)  = 


yiO 

y<0 


(2) 


represents  a  sheer. 

The  set 

is  known  as  the  decision  region  of  the  optimal  BER  equalizer  and  the  decision 
boundary  of  this  equalizer  consists  of  the  set  of  points 

{o(r)e«”'|/*(o('))  =  o} 

When  the  decision  boundary  is  clearly  separable,  the  decision-based  learning 
algorithm  is  effective  to  solve  this  equalization  problem  [1],  However,  due  to  the 
noise  the  decision  boundary  often  are  not  separable. 

In  decision  boundary,  the  influence  of  noise  is  now  investigated.  The  optimal 
boundary  of  NN  equalizer  with  m-2  and  n=0  for  Gaussian  white  noise  with  SNR 
15  dB  and  5  dB  are  illustrated  in  Fig.  1,  where  the  region  on  the  left  half  plane  of 
the  boundary  is  the  optimal  decision  region.  It  is  seen  that  noise  clearly  affects 
the  decision  region. 


3.  LEARNING  RULE  BASED  ON  THE  FUZZY  DECISION 

One  way  of  providing  tolerance,  for  the  non  separable  case,  can  be  derived 
based  on  a  somewhat  fuzzy  decision  [7],  The  fuzzy  decision  has  different 
degrees  of  error  associated  with  each  decision,  for  example,  marginally 
erroneous,  erroneous,  and  extremely  erroneous.  The  technique  inqx)ses  a  proper 
penalty  function  on  all  the  Tjad’  decisions  as  well  as  the  'marginally  correct' 
ones.  The  final  solution  represents  the  best  compromise  in  terms  of  the  total 
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Fig.  1.  Channel  output  points  and  optimal  decision  boimdaiy.  Channel 
oit)  =  jr(r) + 0. 5x(t  ~  l) ,  equalizer  order  m=2  and  lag  n=0; 

(a)  optimal  decision  boimdary, 

(b)  decision  boundary  from  hard  decision  under  5  dB, 

(c)  decision  boundary  from  hard  decision  under  15  dB. 

penalty.  In  short,  this  allows  'soft'  or  'fuzzy'  decision,  as  opposed  to  the  hard 
decision.  To  cope  with  'marginal*  training  sequence,  and  to  provide  a  smooth 
'gradient'  for  training,  the  penally  must  be  a  function  of  the  degree  of  error.  To 
derive  the  fuzzy  decision  training,  firstly  we  can  choose  an  appropriate 
discriminate  function  V)  as 


=  •  l  =  -ll 

where  in  the  training  mode  d{f)=x{t-n)  and  during  the  data  transmission 
d\t)  =  x(t-n). 

The  misclassification  measure  Q  for  a  desired  sequence  belonging  to  the  7- 
decision  region  is  defined  as 

Q{f)  =  W 

The  expected  error  as  an  objective  criterion  for  weights  w  of  MLP  is  defined  as 
follows: 
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L{w)==E[j(Q{t))] 


(5) 


where  a  loss  function  j(e(r )) = - — is  defined  to  evaluate  the  cost  of  the 


current  decision  and  a  is  a  positive  constant  for  scaling. 

The  loss  ftmction  provides  a  means  to  minimize  the  number  of  decision  errors 
and  will  induce  the  learning  rule  to  emphasize  state  located  at  a  short  distance 
from  the  decision  boundary.  Then,  the  loss  function  has  four  regions  for  fuzzy 
decision:  (1)  correct  with  satisfectory  vigilance,  (2)  correct  with  vigilance  to  be 
improved,  (3)  error  on  which  correction  will  be  attempted,  (4)  error  to  remain 
uncorrected,  as  Fig.  2. 


Fig.  2.  Four  regions  for  fuzzy  decision. 

The  gradient  descent  search  method  can  be  applied  to  minimize  the  expected 
error  of  (4).  Suisse  that  a  current  desired  sequence  is  known  to  belong  to 
decision  region  of  I;j(Q{t))  eC^  Finally,  we  can  obtain  the  fuzzy  decision 

training  rule  as 

Reinforced  training: 

Antireinforced  training: 

where  J'(0(O)  1^®  derivative  of  the  loss  function  evaluated  at  Q{t). 
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In  (6)  and  (7),  weights  are  adjusted  in  proportion  to  the  value  of  J'{Q{t))  and 

the  maximum  changes  in  the  weights  are  happened  when  Q{t)=^0,  which  means 
that  the  decision  criterion  is  exactly  in  the  boundary  of  region  1  and  -1.  So  the 
more  confusable  those  two  models  are,  the  reinforced  training  is  carried  out  on 
the  parameters  of  the  correct  model,  while  the  more  antireinforced  training  is 
carried  out  on  the  parameters  of  the  incorrect  model.  Consequently,  the 
discrimination  between  the  correct  model  and  the  incorrect  model  will  be 
increased. 

3.  EXPERIMENTAL  RESULTS 

In  order  to  evaluate  the  performance  of  the  proposed  method,  the  equalization 
for  nonlinear  channel  with  noise  is  studied  The  channel  model  is 

o(t)-o{t)-\‘0.2o^(t)  +  v(t),  v(t)^  N(0,d^ ) 

d{t)^0. 3482 x{t) + 0,8704 x(t  -])  +  0, 3482 x{t  -  2) 

For  and  rj=0,  the  3-layer  neural  networic  (3-9-5-1)  with  3-input  and  1- 
output  is  used  as  equalizer.  Under  SNR  5  dB  and  15  dB,  Fig.  3  show  the 
decision  region  formed  by  a  NN  equalizer  with  fuzzy  decision  learning 
algorithm  and  conventional  decision  learning  algorithm,  and  optimal  decision 
region.  It  can  be  seen  that  the  decision  region  formed  the  fuzzy  decision  is 
near  that  optimal  decision  region  rather  conventional  decision. 

We  compare  the  bit  error  rates  (BER)  achieved  NN  equalizer  and  NN 
equalizer  with  decision  feedback  (DF)  [8]  using  the  proposed  method  and 
conventional  method  for  different  SNRs  and  scale  a  of  loss  function.  The 
equalizers  are  trained  for  the  first  1000  points.  Fig.  4  illustrates  the  BER 
performance  averaged  over  20  runs  started  fi*om  different  random  initial 
weights.  It  may  be  observed  from  Fig.  4  that  the  equalizer  using  the  proposed 
method  attains  about  0.5-1.0dB  improvement  relative  to  the  equalizer  using  the 
hard  decision  learning  rule,  although  the  correspondence  between  the  curves  is 
closet  in  the  low  noise  situations. 

Fig.  5  illustrate  BER  performance  of  NN  equalizer  with  decision  feedback 
using  the  fiizzy  decision  and  hard  decision.  In  NN  equalizer  with  DF  also,  the 
proposed  metho4  performs  the  well  performance  in  comparison  with  the  hard 
decision,  when  the  level  of  additive  noise  is  high. 
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4.  CONCLUSIONS 


In  this  work,  we  propose  an  neural  network  equalizer  with  a  fuzzy  decision 
learning  rule  based  on  the  generalized  probabilistic  descent  algorithm  with  the 
minimum  decision  error  formulation.  It  is  shown  that  the  decision  region 
overlapped  by  noise  can  be  overcome  the  use  of  a  fijzzy  decision  learning 
rule  based  on  the  generalized  probabilistic  descent  algorithm.  We  af^lied  the 
proposed  method  to  equalize  the  nonlinear  communication  channel  with  noise. 
Simulation  results  confirm  that  the  fuzzy  decision  learning  algorithm  worics 
more  effectively  than  the  hard  decision  learning  algorithm  when  the  learning 
patterns  are  not  separable  high  acUitive  noise  situations. 
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o{t) 

Fig.  3.  Decision  region  under  10  dB;  (a)  optimal  decision  boundary,  (b) 
boundary  with  hard  decision ,  (c)  boundary  with  fiizzy  decision 


Fig.  4.  Comparison  of  BER  achieved  by  the  MLP  equalizer,  (a)  hard  decision 
learning  rule,  (b)  fuzzy  decision  learning  {a~10\  (c)  fuzzy  decision  learning 
{a  =100). 
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signal  to  noise  ratio  (dB) 


Fig.  5.  Comparison  of  BER  achieved  by  the  MLP  equalizer  wirh  decision 
feedback,  (a)  hard  decision  learning  rule,  (b)  fuzzy  decision  learning  (a  =  10), 
(c)  fuzzy  decision  learning  («  =  100)- 
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ABSTRACT 

Spatial  beamforming  using  a  known  training  sequence  is  a  well-understood 
technique  for  canceling  uncorrelated  interferences  from  telecommunication 
signals  [1].  Most  of  on-line  adaptive  beamforming  algorithms  are  based  on 
linear  algebra  and  linear  signal  models.  Anyway  both  in  the  transmitter 
amplifier  and  in  the  array  receiver  nonlinearities  may  arise,  producing 
distorted  waveforms  and  reducing  the  performance  of  the  demodulation 
process.  A  nonlinear  spatial  beamformer  with  sensor  arrays  may  use  a 
neural  network  to  cope  with  communication  system  nonlinearities. 

In  this  work  we  show  that  a  feedforward  neural  network  trained  with  a  LS- 
based  algorithm  may  get  the  convergence  in  a  time  suitable  to  most 
applications. 


1.  INTRODUCTION 

Signal  processing  by  sensor  arrays  is  sought  as  a  technique  for  improving  location 
parameter  estimation  (high  resolution  algorithms)  and  increasing  the  capacity  of 
telecommunication  links.  The  key  feature  of  sensor  arrays  is  that  the  number  of 
processed  sources  and  the  Signal-to-Noise-Ratio  (SNR)  are  limited  in  principle 
only  by  the  system  size  [2]. 

In  particular  an  array  is  able  to  create  a  spatial  gain  pattern  with  somewhat 
arbitrary  shape  by  properly  combining  the  outputs,  according  to  a  specified 
optimization  criterion.  Different  patterns  can  be  formed  in  parallel  from  the  same 
received  signals,  enabling  simultaneous  demodulation  and  interference 
suppression  in  a  multiple  source  environment. 

If  the  beam  computation  is  realizable  by  an  on-line  approach,  the  capacity  of  the 
communication  system  can  be  substantially  improved.  This  requires  adaptive 
algorithms  able  to  converge  to  the  steady-state  solution  before  link  parameters 
change.  This  is  a  serious  problem  in  mobile  communication  systems,  that  may 
suffer  also  from  the  presence  of  nonlinearities  along  the  signal  path  [3]  and 
impulsive  interferences,  thus  requiring  a  proper  nonlinear  treatment  [4]. 

The  signal  model  at  the  array  output  is  represented  by  the  classical  linear  equation 

[2]: 


0-7803-4256-9/97/$  10. 00  ©1997  IEEE 
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+  n(0 


(1) 


x(0  =  A 


s(t) 

i(0 


where  x(t)  is  the  M-by-1  sensor  snapshot  vector  at  time  t,  s(t)  is  the  ideal  K-by-1 
(K<M)  signal  vector,  A  represents  the  (unknown)  M-by-Q  array  steering  matrix 
[2],  i(t)  is  the  (Q-K)-by-l  interference  (K<Q<M)  and  n(t)  is  the  additive 
background  noise,  uncorrelated  with  source  and  interference.  All  involved  signals 
are  assumed  to  be  complex- valued  (analytic  signals)  [1],[2]. 

Adaptive  spatial  beamforming  often  uses  a  preamble  training  sequence  y(t)  to 
recognize  useftil  signals  and  cancel  interferences  by  steering  array  nulls  toward 
disturbances  [1]. 

y(t)  may  be  an  analog  local  replica  of  the  useful  signal(s)  s(t),  or  the  sequence  of 
modulated  symbol  vectors  {ai,i=l,2,..,N}  which  forms  s(t)  [3]: 

N 

s(t)  =  ^a,u(t-iT)  (2) 

i=l 

The  goal  is  to  choose  the  proper  parameters  of  a  function  F(x(t))  of  the  array 
output  which  best  approximates  y(t).  In  classical  beamforming  the  function  is 
linear  and  is  expressed  as  the  Hermitian  product  with  a  weight  vector  w: 


y(0  =  w''x(0  +  e(0.  (3) 

where  e(t)  is  the  approximation  error. 

Due  to  channel  non-stationarity  and  communication  efficiency  requirements  the 
adaptation  process  of  w  should  be  fast.  The  most  widely  known  algorithms  for  on¬ 
line  adaptation  are  based  on  the  stochastic  gradient  descent,  or  Least  Mean 
Squares  (LMS)  [5].  Anyway,  the  rate  of  convergence  of  LMS  is  linear  [6]  and  is 

bounded  by  (p-l)^/(p+l)2,  where  p  is  the  condition  number  of  the  Hessian  matrix 

[6]-  ,  . 

As  a  matter  of  fact,  the  Hessian  matrix  in  narrowband  adaptive  beamforming 
coincides  with  the  (scaled)  spatial  cross  sensor  correlation  matrix  (CSCM)  of 
sensor  outputs.  Its  condition  number  is  of  the  same  order  of  the  array  SNR  and 
can  be  very  high  (10^^10^)  in  telecommunication  applications.  For  this  reason 
linear  beamformers  frequently  use  methods  based  on  Recursive  Least  Squares 
(RLS)  to  get  higher  rates  of  convergence  with  respect  to  classical  gradient-based 
approaches  [5]. 


2.  NEURAL  BEAMFORMING 

In  communication  systems  the  adaptive  array  is  part  of  a  chain  of  blocks;  most  of 
them  are  intrinsically  nonlinear  or  may  exhibit  undesired  nonlinearities 
(amplifiers,  mixers,  clampers,  ...).  In  order  to  cope  with  these  nonlinearities,  a 
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nonlinear  beamformer  should  be  employed.  Feedforward  multilayer  neural 
networks  may  provide  a  solution  to  this  problem  [7]  in  a  straightforward 
approach. 

The  input-output  relationship  of  the  proposed  neural  network  beamformer  with  M 
sensors  is: 


y(0  =  /^{x(0}+e(0  (4) 

In  this  case  the  correct  model  for  x(t)  is  related  to  s(t)  by  the  equation: 


x(0  =  AG 


+  n(0 


(5) 


In  this  formula  G{.}  is  an  unknown  nonlinear  M-by-Q  matrix  transfer  function, 
that  should  be  inverted  by  the  neural  beamformer. 

Using  the  L2  norm,  the  error  functional  Jto  be  minimized  is: 


7  =  £'{-race[e(0e”(0]}  (6) 

where  E{.}  denotes  the  expectation  operator  over  time,  trace[.]  is  the  matrix  trace 
operator  and  (.)^  indicates  Hermitian  transposition. 

The  neural  beamformer  realizes  a  nonlinear  memoryless  functional  of  x(t).  A 
standard  multilayer  perceptron  (MLP)  can  approximate  arbitrary  input-output 
relationships  of  this  kind  [7].  The  input  of  the  MLP  are  the  sensor  outputs,  while 
y(t)  contains  the  target  signals. 

The  minimization  of  J  may  be  accomplished  by  separating  the  real  and  the 
imaginary  parts  of  all  signals  and  using  a  standard  backpropagation  (BP) 
approach  [7].  As  an  alternative,  the  complex  neuron  model  can  be  used,  which 
gives  some  benefits  for  the  reduced  number  of  free  weights  [5], [7]. 
Backpropagation  is  a  stochastic  gradient  descent  method  and  is  characterized  by 
the  same  limitations  of  LMS  [7].  Faster  second-order  [6]  convergence  can  be 
obtained  with  the  Block  Recursive  Least  Squares  (BRLS)  approach  described  in 
[8], [9].  In  [9]  BRLS  is  shown  to  be  a  Newton-type  algorithm  able  to  reach 
convergence  with  a  very  favorable  numerical  conditioning. 

In  this  work  we  show  by  numerical  simulations  how  a  neural  beamformer  trained 
with  the  BRLS  technique  can  be  very  effective  in  the  presence  of  high  levels  of 
noise  and  interference,  while  detecting  and  recovering  multiple  signals  of  interest. 
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Figure  1:  two-layer  MLP. 


3.  SUMMARY  OF  THE  BRLS  ALGORITHM 


The  BRLS  algorithm  is  an  iterative  learning  procedure  which  solves  an 
overdetermined  linear  system  of  equations  for  each  layer  of  the  network.  With 
respect  to  similar  algorithms,  it  minimizes  the  error  functional  at  the  linear 
summation  nodes  of  the  neurons  {descent  in  the  neuron  space  [9]),  in  contrast  to 
the  traditional  optimization  in  the  weight  space. 

At  the  n-th  iteration  and  for  the  k-th  layer  we  introduce  the  following  matrices; 


^Xk,l(n)^ 

f  T  \ 

yk,l(n) 

Xk(n)  = 

Yk(n)  = 

^Xk.p(n); 

where  x^k^p(n)  and  y^k,p(ii)  ^^e  the  input  and  output  row  vectors  of  the  linear 
section  of  the  MLP  layer,  in  the  presence  of  the  p-th  learning  pattern  (P  is  the 
length  of  the  whole  batch).  The  lengths  of  x\^p(n)  and  y^k,p(*')  respectively 
(Nk+1)  and  Nk+i,  being  Nk  the  number  of  input  units  to  the  layer;  Xi(n)  contains 
the  external  inputs,  while  XL+i(n)  contains  the  global  outputs  of  the  net  (see  fig.  1 
for  the  case  L=2). 

Matrices  Yk  are  computed  in  a  forward-propagation  step  through  the  k-th  layer, 
similar  to  that  of  standard  BP  [7]: 

Xk(n)Wk(n)  =  Yk(n)  (8) 

while  the  passage  through  the  nonlinearities  is  represented  by  the  following: 
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(9) 


Xk+l(n)  =  [f(Yk(n))  1] 

f(.)  represents  the  activation  function  and  1  is  the  column  of  the  bias  inputs.  At  the 
first  iteration  weight  matrices  Wk  are  initialized  at  random. 

The  optimization  in  the  neuron  space  is  performed  separately  for  each  layer 
following  the  formula: 

yk(n)=Yk(n)  +  Dk(n)  (10) 

where  \  is  an  estimation  of  the  desired  Yk  based  on  the  direction  matrix  Dk- 
Dk  may  be  chosen  in  several  ways,  depending  on  the  method  being  adopted.  The 
simplest  choice  is  the  negative  of  the  gradient  matrix: 

Yk(n)  =  yk(n)-TiVY,£  OD 

where  T|  is  a  proper  correction  factor.  The  derivatives  of  E  w.r.t.  the  y's  are 
computed  using  formulas  similar  to  those  of  BP  [7]. 

Given  the  estimate  Yj^ ,  the  new  weight  matrix  for  each  layer  is  computed  from 
the  Least  Squares  (LS)  solution  of  the  following  system: 

Xk(n)Wk(n  +  l)  =  Yk(n)  (12) 

where  in  particular  QR  or  SVD  based  algorithms  can  be  used  [5].  Formula  (12) 
represents  the  general  formulation  of  the  class  of  LS-based  learning  algorithms;  it 
consists  in  perturbing  the  matrix  Y^  in  order  to  recover  the  consistency  of  the 
system  in  the  LS  sense.  This  gives  the  new  weight  matrix  Wk(n+1),  to  be  used  in 
the  next  forward-propagation  step. 

In  order  to  stabilize  learning  in  earlier  steps,  when  weight  are  far  from  optimal 
values,  a  recursive  implementation  can  be  adopted,  updating  the  solution  by  the 
classical  on-line  RLS-QR  algorithm  [5], [9].  In  this  case  the  forward  and 
backpropagation  phases  act  on  the  last  block  of  snapshots  [2].  A  proper 
exponential  forgetting  factor  X  can  be  used  to  discard  the  influence  of  older 
samples  [5].  More  details  about  the  BRLS  algorithm  can  be  found  in  [9].  Here  we 
point  out  that  the  numerical  robustness  of  the  algorithm  is  threatened  by  the 
severe  ill-conditioning  of  matrix  Xi  which  is  just  the  square  root  of  the  array 
CSCM.  However  the  use  of  a  square  root  formulation  keeps  the  condition  number 
acceptably  low  (lO'^^lO^),  allowing  the  use  of  the  limited  precision  floating-point 
arithmetic  offered  by  commercial  DSP  microprocessors. 
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4.  EXPERIMENTAL  TRIALS 


In  the  proposed  computer  experiment  a  linear  equispaced  array  of  ten  sensors, 
with  an  intersensor  spacing  of  half  wavelength,  is  used  as  receiver.  The  useful 
signals  are  two  independent  8-QAM  waveforms  with  a  SNR  of  10  dB  w.r.t.  each 
sensor,  impinging  from  8  and  25  degrees  of  azimuth,  referred  to  the  array 
broadside;  the  elevation  is  zero  degrees  for  both  sources.  The  signals  of  interest 
are  distorted  by  a  memoryless  arctangent  nonlinearity,  which  models  the  amplifier 
static  transfer  function.  The  interference  is  represented  by  a  white  Gaussian  noise 
source,  coming  from  a  direction  of  15  degrees,  with  a  SNR  of  20  dB.  The  array 
receiver  gains  fluctuate  with  a  standard  deviation  of  1%  of  their  nominal  values 
during  the  experiment.  The  background  noise  is  supposed  to  be  Gaussian,  white 
and  isotropic  [2].  The  conditions  of  the  experiment  are  recognized  to  be  rather 
unfavorable  since  all  coherent  sources  are  within  one  array  beamwidth  [2],  [3]. 

The  neural  network  used  in  the  experiment  is  a  multilayer  perceptron  with  20 
inputs,  8  hidden  neurons  and  four  outputs,  with  sigmoidal-type  nonlinearities. 
Learning  was  performed  with  the  BRLS  algorithm  on  100  epochs  of  30  snapshots 
each.  Several  values  for  the  forgetting  factor  X  were  tried;  in  the  described 
experiment  ^=0.99  was  used.  The  following  figure  shows  the  curve  of  the  ratio  of 
the  target  signal  power  (Mean  Squared  Signal,  MSS)  to  the  error  power  (Mean 
Squared  Error,  MSE)  for  each  source  of  interest  during  the  learning.  The  steady 
state  solution  is  reached  after  about  35  epochs. 

The  almost  monotonic  shape  of  the  learning  curves  demonstrates  the  ability  of  the 
BRLS  algorithm  to  deal  with  ill-conditioned  problems  and  to  track  the  short  time 
channel  fluctuations  that  can  be  expected  in  real  systems.  Also  remarkable  is  the 
insensitivity  of  the  BRLS  method  to  the  starting  guess  of  the  network  weights  [9], 
which  is  essential  for  successful  signal  processing  applications. 


CONCLUSION 

The  recently  introduced  BRLS  algorithm  for  fast  training  MLP  networks  allows 
the  use  of  neural  architectures  in  challenging  multichannel  DSP  problems, 
characterized  by  severe  ill-conditioning  of  the  data  matrix  coupled  with  stringent 
requirements  on  convergence  rate.  The  general  approach  described  in  [9]  has  a 
great  flexibility  in  changes  of  the  error  functional  and  learning  parameters,  and 
may  introduce  several  forms  of  weight  regularization  through  system  (12)  [5].  We 
plan  to  apply  the  BRLS  neural  approach  to  combine  space-time  equalization  of 
communication  channels  and  nonlinear  distortion  correction. 
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Figure  2:  Learning  curves  for  the  two  complex  outputs. 
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ABSTRACT 

Methods  which  combines  outputs  of  multiple  pattern  classifiers  to 
enhance  the  overall  classification  rate  are  studied.  Specific  attention  is 
given  to  combination  rules  which  are  independent  of  the  input  feature 
vectors.  Potential  performance  enhancement  and  limits  of  this  so 
called  stack  generalization  method  are  discussed.  In  particular,  a 
phenomenum  called  ” alias”  is  introced  which  gives  an  upper  bound  of 
the  performance  which  can  be  achieved  uisng  stack  generation  for  a 
given  set  of  member  classifiers.  Experimentation  using  several 
machine  learning  data  bases  are  reported. 


L  INTRODUCTION 

Pattern  classification  is  the  enabling  technology  for  speech 
recognition,  image  understanding,  target  recognition,  and  other 
important  signal  processing  applications.  A  pattern  classifier  is  a 
decision-making  algorithm  which  determines  the  class  label  of  a 
feature  vector  presented  to  the  classifier.  Based  on  statistical  decision 
theory,  artificial  intelligence,  fuzzy  logic  theory,  and  many  other 
approaches,  numerous  types  of  pattern  classifiers  have  been 
developed  [3].  However,  it  remains  an  open  question  on  which 
pattern  classifier  to  use  given  a  particular  problem  on  hand.  It  is 
generally  accepted  that  a  universal  pattern  classifier  which  will  out¬ 
perform  every  other  pattern  classifiers  is  unlikely  to  be  found. 

A  practice  which  is  gaining  popularity  is  to  combine  the  output 
of  several  pattern  classifiers,  using,  say  the  majority  voting 
combination  method.  The  situation  is  analogous  to  the  decision 
making  process  in  human  society  where  many  experts,  each  specialize 


0-7803-4256-9/97/$ 1 0.00  ©1997  IEEE 


568 


in  a  sub-fields,  are  often  summoned  to  form  a  committee  to  solve  a 
complicate  problem  in  a  collective  manner.  The  belief  is  that  collective 
efforts  can  often  arrive  at  a  superior  decision  than  by  any  individual 
expert.  A  pattern  classifier  formed  by  a  combination  of  several 
member  classifiers  is  called  a  committee  classifier  in  this  study.  A 
number  of  prior  studies  of  committee  classifiers  have  been  reported 
There  are  several  empirical  studies  of  combining  multiple  classifiers 
reported  in  literature  [1],  [2],  [5],  [6],  [10],  [11],  [14],  [15],  [16], 
[4],  [9],  [12],  [13],  [17],  [18].  We  note  there  is  another  family  of 
algorithms  for  combining  multiple  classifiers,  called  mixture  of  expert 
(MoE)  approach  [7],  [8],  [19].  In  the  MoE  approach,  the  output  of 
member  classifiers  are  linearly  combined  with  weights  which  are 
functions  of  input  feature  vectors  through  the  use  of  a  "gating 
network”.  In  this  study,  we  distinguish  committee  classifiers  from 
Mixture  of  experts  by  restricting  the  combination  rules  of  a  committee 
classifier  to  be  dependent  on  output  of  member  classifiers  only,  and 
not  directly  dependent  on  the  inputs  to  each  member  classifier. 

Many  of  these  works  reported  that  the  enhancement  of 
classification  rate  of  a  committee  classifier  will  be  maximized  when  its 
member  classifiers  are  independent  to  each  other.  This  is  often  done 
assuming  the  committee  classifier’s  output  is  an  ensemble  average 
(linear  combination)  of  those  of  its  member  classifiers.  Such  an 
analysis  is  more  suitable  when  the  member  classifier's  output  is 
interpreted  as  an  estimate  of  the  posterior  probability  of  a  given  feature 
vector  belongs  to  a  specific  class.  For  those  classifiers  whose  output 
is  binary  valued,  such  an  analysis  is  not  quite  applicable.  In  [5],  the 
voting  mechanism  of  member  classifiers  is  analyzed  assuming  the 
output  of  each  classifier  obeys  a  binomial  distribution.  Some 
asymptotic  behavior  of  such  a  majority  committee  classifier  has  been 
given.  In  general,  majority  voting  is  a  simple,  yet  effective 
combination  method. 

Wolpert  [15],  [16]  used  the  term  "stack  generalization”  to 
describe  a  general  committee  classifier  whose  output  may  be  multiple 
level  nonlinear  combination  of  lower  level  member  classifiers.  Voting 
can  be  regarded  as  a  special  case  of  stack  generalizers.  In  this  study, 
we  will  analyze  what  is  the  best  performance  a  nonlinear  committee 
classiifer  can  achieve  given  that  each  member  classifier  gives  only 
binary  output  and  is  fixed  (i.e.  can  not  be  modified  or  trained).  We 
identified  a  phenomenum  called  "aliasing”  which  corresponds  to  the 
situation  that  feature  vectors  with  different  class  labels  having  the 
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same  output  combination  from  all  member  classifiers.  When  an  alias 
occurs,  the  combined  classifier  will  never  reach  perfect  classification. 

In  section  U,  some  basic  notions  of  a  committee  classifier  will  be 
reviewed  briefly.  In  section  El,  the  alias  phenonum  will  be  analyzed. 
Some  simulation  results  will  be  reported  as  well.  In  the  following 
discussion,  we  will  denote  x  to  be  the  current  feature  vector  presented 
to  a  classifier,  and  y(i)  as  the  output  of  the  i^h  expert  classifier.  The 
output  of  the  combined  committee  classifier  will  be  denoted  by  z. 


IL  LINEAR  COMMITTEE  CLASSIFIER 

Committee  classifier  consists  of  a  committee  of  n  individual 
pattern  classifiers.  There  outputs,  denoted  by  {y(i);  1  <  i  <  n}  are  to 
be  combined,  linearly  or  non-linearly,  via  a  set  of  combination  rules, 
to  form  the  final  output,  z.  For  classification  problem,  the  outputs 
y(i)  and  z  are  c  by  1  vectors  with  a  "1"  in  an  entry  indicating  the 
classifier  decides  that  the  input  feature  vector  x  belong  to  the  class. 
Usually,  one  would  allow  only  one  element  to  be  1  and  the  rest  should 
remain  at  0. 

A  linear  committee  classifier  approach  is  based  on  the 
assumption  that  each  classifier's  output  is  a  real  number  between  0 
and  1  and  can  be  interpreted  as  an  estimate  of  the  posterior  probability 
of  X  is  drawn  from  class  i  given  its  value  x,  P{ilx}.  A  model  of  the 

classifier's  output  can  be  written  as:  y(x,i)  =  P{ilx}  +  8(ilx)  where 

e(ilx)  is  a  random  estimation  error  with  zero  mean  and  variance  a^(x). 

Then  the  objective  is  to  find  a  set  of  weights  {w(i);  1  <  i  <  n}  such 
that  the  variance  of  the  overall  linear  estimate 

n 

l|z(x)-X  y(x,i)w(x,i)|p  (1) 

i=l 

is  minimized.  This  minimum  variance  estimate  so  obtained  can  be 
found  as: 
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1 


(2) 


w(x,i)  = 


cTf(x)  X  [1/ct/(x)]' 


In  other  words,  the  weights  are  inversely  proportional  to  the  variance 
of  the  estimate.  To  apply  this  method,  one  must  estimate  the  error 

variance  of  a^(x)  for  each  expert  classifier. 


III  NONLINEAR  COMBINATION  RULES 

Nonlinear  combination  rules  can  be  regarded  as  a  general  meta¬ 
classifier  designed  to  classify  a  concatenated  feature  vector  y(x)  = 
[y(x,l)  y(x,2)  •  •  •  y(x,n)].  Any  known  classifier  structures,  such  as 
MAP  (maximum  a  posterior  probability),  kNN  (k  nearest  neighbors), 
SOM  (self-organization  map),  decision  trees  (e.g.  ID3),  can  be 
applied  to  serve  this  purpose.  The  question  is,  is  there  any  way  to 
predict  how  the  committee  classifier  performs  compared  to  individual 
member  classifiers? 
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Figure  1.  Examples  illustrating  aliasing  effect  of  a 
committee  classifier 

Let  us  examine  a  special  case  illustrated  in  figure  1  when  both  inputs 
(feature  vectors)  and  outputs  of  each  classifier  are  discrete  value  in  {0, 
1 }.  In  this  case,  there  are  only  2^  different  input  combinations  where 
k  is  the  feature  vector  dimension.  In  other  words,  we  have  a  Boolean 
function  realization  problem  on  hand.  Let  us  focus  on  a  single  class  at 
a  time.  We  now  have  a  truth  table  similar  to  one  shown  in  figure  1 
(shadowed  cells  indicate  misclassification  of  the  corresponding 
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classifier).  In  this  figure,  y  is  the  desired  mapping  and  yl-y4  are  4 
imperfect  member  classifiers.  The  question  now  is:  Given  yl,  y2,  y3, 
y4,  can  a  meta-classifier  (combination  rules)  be  defined  so  that  it 
gives  an  output  which  is  the  same  as  y?  Here  is  an  incomplete  4 
Boolean  variable  minimization  problem,  and  one  of  the  solution  is:  y 
=  y2*y4  +  yl*y4  +  yl*y2.  In  other  words,  combining  yl,  y2,  and 
y4,  the  committee  classifier  is  able  to  yield  100%  classification  rate  on 
this  training  data  set  -  a  performance  better  than  any  individual 
classifier.  On  the  other  hand,  if  there  are  only  yl,  y2,  y3  are 
available,  note  that  when  (yl,y2,y3)  =  010,  and  101,  both  of  them 
appear  twice  with  different  values  of  y.  Thus,  one  can  choose  only 
one  of  the  values.  This  implies  that  the  maximum  classification  rate 
will  be  at  most  6/8  which  is  no  better  than  either  y  1  or  y2  alone.  This 
phenomenon  of  having  different  target  values  associated  to  the  same 
classifier  output  combination  is  called  alias. 

When  y(x,i)€  {0,  1},  the  feature  space  X  is  decomposed  into 
disjointed  regions  labeled  by  the  derived  feature  vectors  (yl,  y2,  •  •  •). 
Alias  can  be  regarded  as  the  mis-classification  errors  within  each  of 
these  regions.  Therefore,  an  optimal  combination  rule  would  be  to 
assign  each  of  these  regions  to  a  class  label  which  minimize  the  alias 
error.  As  a  matter  of  fact,  in  this  case,  the  optimal  committee  classifier 
amounts  to  a  look-up  table  which  maps  from  each  binary  vector  to  a 
specific  class  label. 


IV.  Experimentation 

We  have  employed  the  four  data  sets  from  machine  learning 
database  at  UCI.  They  are:  A.  credit  card  applications,  B.  breast 
cancer  diagnosis,  C.  DNA  Promoter  sequence  recognition,  and  D. 
poisonous  mushroom  identification.  Each  data  file  is  randomly 
partitioned  into  three  parts.  A  three-way  cross-validation  procedure  is 
adopted  to  better  estimate  the  generalization  error:  Each  method  is 
applied  three  times  (trials)  to  each  data  file.  In  each  trial,  two  of  the 
three  parts  are  used  as  training  data,  and  the  third  as  the  testing  data. 
After  three  trials,  each  of  the  three  parts  of  the  original  data  file  will  be 
tested  exactly  once.  The  testing  error  rates  of  the  three  trials  then  are 
averaged  to  yield  the  overall  classification  rate  of  a  particular 
classification  method  on  a  given  data  set. 
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Four  expert  classifiers  are  used  in  this  experiment:  a  3-nearest 
neighbor  (3NN)  classifier,  a  maximum  likelihood  (ML)  classifier,  a 
Learning  vector  quantization  (LVQ  2.1)  classifier,  and  a  multi-layer 
perceptron  (MLP)  classifier.  In  developing  the  ML  classifier,  feature 
vectors  in  each  class  is  assumed  to  have  a  normal  distribution.  Thus, 
it  effects  a  linear  classifier.  With  the  MLP  classifier,  a  2-layer,  fully 
connected  configuration  (one  hidden  layer)  is  used,  with  10  hidden 
units  -  a  number  assigned  arbitrarily.  Since  our  objective  here  is  not 
to  compare  performance  of  individual  classifiers,  sub-optimal 
implementation  of  these  classifiers  should  not  prevent  us  from 
comparing  results  between  the  committee  classifier  to  the  best  of  the 
individual  classifiers.  Each  of  these  four  classifiers  will  be  used  as  the 
committee  classifier  classifying  not  only  the  output  of  the  member 
classifiers,  but  also  the  original  feature  vector  to  facilitate  aliases-free 
classification.  All  but  the  LVQ  algorithm  are  implemented  with  Matlab 
(v.4.2c)  m-files,  tested  on  a  HP  workstation.  The  LVQ  algorithm  is 
implemented  by  the  SOM  research  group  of  the  University  of 
Helsinki,  and  is  available  dX  ftp://cochlea.hut fi/ pub/. 

Note  that  for  each  data  set  and  each  classification  method,  there 
are  actually  three  different  expert  classifiers  developed  -  each 
developed  on  one  of  the  three  different  training  data  set.  Thus,  all 
experiment  are  performed  three  times  on  these  three  different 
partitions,  and  the  results  are  reported  as  the  average  of  three  trials. 
The  distribution  of  these  inputs  are  summarized  in  the  following  table 
1  to  table  3  below.  The  order  of  outputs  are  ML-3NN-LVQ. 


1  Output  of 

Cancer  1 

Cancer  2 

Cancer  3  I 

Class  1 

Class  2 

Class  1 

Class  2 

Class  1 

Class  2 

111 

109 

2 

110 

1 

106 

1 

112 

0 

0 

1 

0 

0 

3 

121 

0 

6 

0 

2 

0 

5 

211 

0 

0 

0 

0 

0 

0 

222 

0 

11 

0 

6 

0 

8 

221 

0 

0 

0 

0 

0 

0 

212 

0 

0 

0 

0 

0 

0 

122 

0 

46 

4 

50 

2 

49 

Table  1.  Cancer  data  set  output. 
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11  Output  of 

Card  1 

Card  2 

Card  3  | 

Class  1 

Class  2 

Class  1 

Class  2 

Class  1 

Class  2 

111 

60 

6 

69 

8 

58 

8 

112 

0 

0 

2 

0 

1 

0 

121 

3 

2 

2 

0 

3 

0 

211 

12 

5 

7 

5 

14 

12 

222 

6 

57 

11 

55 

7 

48 

221 

5 

4 

2 

1 

3 

6 

.  212 

5 

6 

5 

3 

7 

5 

122 

1 

0 

1 

1 

0 

0 

Table  2.  Card  data  set  output. 


II  Output  of 

Heart  1 

Heart  2 

Heart  3  | 

Class  1 

Class  2 

Class  1 

Class  2 

Class  1 

Class  2 

111 

73 

12 

73 

12 

73 

13 

112 

2 

3 

3 

7 

0 

5 

121 

4 

6 

1 

2 

7 

3 

211 

4 

1 

3 

7 

3 

5 

222 

14 

58 

13 

67 

17 

45 

221 

4 

1 

5  ^ 

9 

2 

0 

212 

4 

5 

0 

1 

4 

3 

122 

5 

34 

3 

24 

5 

45 

Table  3.  Heart  data  set  output. 

The  shaded  area  in  these  three  tables  have  two  or  three  outputs 
indicating  class  2,  and  according  to  majority  voting  rule,  it  should  be 
classified  as  class  2.  We  see  that  aliases  do  occur.  For  example,  in 
table  3,  under  the  column  of  Heart  1,  while  all  three  classifiers’ 
outputs  indicating  class  1,  there  are  12  samples  actually  belong  to 
class  2.  Therefore,  no  matter  how  smart  the  committee  machine  will 
be,  it  will  be  unable  to  distinguish  these  12  samples  as  class  2.  This  is 
verified  in  the  following  experiment:  We  construct  an  induced  feature 
vector  which  consists  of  the  outputs  of  each  of  the  four  classifiers. 
Then  we  develop  a  committee  classifier  to  classify  these  extended 
feature  vectors  using  each  of  the  four  types  of  classifiers  (3-NN,  ML, 
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LVQ,  MLP).  Again,  three  trials  are  performed  on  each  of  the  three 
different  partitions  of  each  data  set,  and  the  results  are  reported  below: 


Voting 

ML 

3-NN 

LVQ  3.0 

Optimal 

Cancer 

4.98% 

2.49% 

3.64% 

4.02% 

2.11% 

Card 

19.18% 

19.18% 

19.77% 

19.18% 

18.41% 

Gene 

21.98% 

26.86% 

22.66% 

14.88% 

13.79% 

Heart 

22.03% 

22.09% 

22.90% 

22.03% 

20.00% 

Table  4.  Classification  Error  rates  of  committee 
classifiers  with  outputs  of  member 
classifiers  only 

From  this  table,  we  observe  that  the  committee  combination  will 
at  times  out-performs  the  majority  voting  significantly.  However, 
some  combinaton  methods,  notably  the  3-NN  methods  consistently 
performs  worse  than  the  simple  voting.  This  is  because  there  are  only 
8  different  distinct  input  data  samples,  and  3-NN  is  inadequate  to 
perform  classification  when  the  number  of  distinct  data  samples  are 
too  few.  The  entries  in  the  column  labelled  with  "optimal”  are 
minimum  classification  errors  can  be  achieved  by  the  committee 
classifier  given  the  component  classifiers.  It  is  a  lower  bound.  The 
errors  incurred  in  this  column  are  due  entirely  to  the  aliasing  effect 
discussed  earlier. 

From  above  results,  we  observe  that  compared  to  simple 
majority  voting,  the  committee  machine  approach,  at  least  with  this 
experiment,  does  not  significantly  improve  the  classification 
performance  in  general.  Among  the  three  different  classifiers,  LVQ 
3.0  seems  consistently  out-perform  the  voting  method,  while  other 
two  classifiers  gives  mixed  results.  Compare  the  committee  method, 
and  the  extended  conunittee  method,  where  original  features  are  used, 
the  results  are  mixed.  Our  preliminary  explanation  is  that  the  additional 
dimension  causes  the  ML  or  3-NN  based  committee  classifier 
confused. 
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Abstract 

An  algorithm  for  unsupervised  speaker  classification  using  Kohonen  SOM 
is  presented.  The  system  employs  6x10  SOM  networks  for  each  speaker  and 
for  non-speech  segments.  The  algorithm  was  evaluated  using  high  quality  as 
well  as  telephone  quality  conversations  between  two  speakers.  Correct 
classification  of  more  than  90%  was  demonstrated.  High  quality  conversation 
between  three  speakers  yielded  80%  correct  classification.  The  high  quality 
speech  required  the  use  of  12***  order  cepstral  coefficients  vector.  In  telephone 
quality  speech,  additional  12  features  of  the  difference  of  the  cepstrum  were 
required. 


INTRODUCTION 

Speaker  recognition  (identification  and  verification)  is  being  used  in  many 
commercial,  military  and  forensic  applications.  Usually  the  problem  is  defined  as 
supervised  classification,  where  a-priori  knowledge  on  the  speakers  is  available  so 
that  pre-training  can  be  performed  [1-4].  In  many  applications,  however,  no  such 
a-priori  knowledge  is  available.  Unsupervised  methods  must  be  used. 

Solutions  to  various  aspects  of  the  problem  have  been  suggested  in  the 
literature.  The  application  of  hierarchical  NN  was  described  in  [5],  and  HMM 
based  systems  in  [6-9].  Other  methods  based  on  EM  algorithm  for  Gaussian 
mixture  estimation  [10],  and  various  VQ  methods  [1 1-13],  were  also  employed  . 

In  general,  given  a  multi-speaker  conversation,  the  algorithm  has  to  estimate 
the  number  of  speakers,  to  segment  the  speech  signal  and  to  assign  each  segment 
to  its  speaker.  The  problem  has  been  also  termed  “speech  segmentation”  [10-11]. 
In  our  current  application  the  number  of  speakers,  R,  is  assumed  to  be  known. 
Generally,  during  a  conversation,  it  may  happen  that  one  speaker  interferes  with 
another.  We  assume  that  the  speech  signal  does  not  contain  such  interference, 
namely  simultaneous  speech  does  not  occur.  All  segments  with  simultaneous 
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speech  are  currently  manually  eliminated  from  the  data  prior  to  performance 
evaluation,  (during  the  training  process  all  the  data  was  used). 

We  suggest  here  an  unsupervised  classification  system  that  first  makes  a 
preliminary  segmentation  into  speech/non-speech  segments,  using  only  “energy” 
threshold.  The  system  then  automatically  trains  R+1  Kohonen  SOM  [14]:  R  for  the 
speakers  and  one  for  non-speech  segments.  Initial  conditions  are  set,  and  then  all 
neural  networks  (NN)  compete  among  themselves  until  a  balance  is  achieved. 

There  were  four  reasons  why  Kohonens’  SOM  was  chosen.  First,  an 
unsupervised  learning  algorithm  was  required  because  of  the  problem  definition. 
Second,  due  to  short  segments,  multiple  centroids  are  required  to  describe  each 
speaker.  Third,  when  we  use  SOM’s,  every  SOM  defines  a  different  speaker  model 
(or  non-speech  model).  If  we  use  one  large  network,  it  would  be  impossible,  to 
indicate  which  centroids  (or  neurons)  belongs  to  the  same  model.  For  this  reason 
other  unsupervised  networks  such  as  ART2  [15],  or  the  network  architecture 
proposed  by  Nissani  [16],  cannot  be  used.  Fourth,  every  SOM  is  a  trained  code 
book  (CB),  this  means  that  it  can  be  used  as  CBs  for  discrete  HMM  that  can  later 
be  used  for  (supervised)  speaker  recognition. 


SYSTEM’S  ARCHITECTURE 

The  general  block  diagram  of  the  system  is  shown  in  figure  1. 
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The  speech  analysis  was  based  on  overlapping  15  milliseconds  analysis  frames, 
with  5  millisecond  frame  rate.  Each  frame  was  represented  by  a  features  vector 
which  included  the  12^  order  cepstral  coefficients,  estimated  from  the  12^  order 
LPC.  In  the  telephone  data,  the  features  vector  was  augmented  by  the  12^  order 
first  difference  cepstral  coefficients  [1],  In  addition  the  mean  absolute  value  of  an 
accumulated  50  millisecond  frame  was  also  calculated,  for  speech/non-speech 
evaluation. 

Rough  segmentation  of  speech  and  non-speech  data  was  performed  by 
thresholding  the  absolute  value  feature.  The  threshold  level  was  set  at  three  percent 
of  the  maximum,  for  high  quality  speech  and  one  percent  for  telephone  data.  The 
levels  were  determined  experimentally.  The  fact  that  higher  level  was  required  for 
high  quality  speech  seems  illogical.  It  is  probably  due  to  the  fact  that  in  the  high 
quality  data  the  variance  of  the  speech  amplitude  is  much  lower  than  that  of  the 
telephone  speech.  The  use  of  more  sophisticated  speech  detection  algorithms 
should  explored  hear. 

The  initial  conditions  to  the  system  were  determined  as  follows:  all  segments, 
classified  by  the  rough  speech/non-speech  classifier  as  non-speech,  were  used  to 
train  the  non-speech  network.  Segments  roughly  classified  as  speech  segments 
were  randomly  and  equally  divided  and  used  to  train  the  R  speaker  models. 

Each  one  of  the  models  (including  the  non-speech  model)  was  a  Kohonen  6x10 
SOM.  Each  SOM  was  trained  by  the  Kohonen  algorithm  [14].  The  inputs  to  the 
SOM  were  the  cepstrum,  or  cepstrum  and  difference  cepstral  coefficients.  The 
outputs  of  the  SOM  were  Euclidean  distances  between  input  vector  and  network’s 
weight  vectors.  In  each  iteration,  at  the  end  of  the  training  process,  regrouping 
process  was  employed.  The  grouping  process  was  performed  with  a  segments  of 
100  frames  (0.5  second). 

The  algorithm  is  based  on  clustering  the  data  in  such  a  way  that  a  total  error 
criterion,  during  regrouping,  is  minimized. 

Let  be  the  Euclidean  distance  between  the  n-th-vector  of  the  k-th 

segment  (  ^  )  and  the  closest  centroid  in  the  r-th  model,  during  iteration  m: 

In  the  m-th  iteration,  the  total  distance  between  the  k-th  segment  and  the  r-th 
model,  ,  is  given  in  (2). 

100 

Di'”)(r)=XdW(r)  (2) 

«=1 

The  k-th  segment,  Sf^ /is  assigned  to  the  model]  (SOMj)  yielding  minimum 
total  error: 
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(3) 


j  =  arg  (r)}  ^  SOM. 

Hence  an  iteration  of  the  process  is  defined  by: 

1 .  Retrain  the  models  with  the  new  clusters,  achieved  by  the  previous  iteration. 

2.  Regroup  the  data  using  (3). 

3.  Check  for  termination:  If  termination  criterion  is  met,  exit,  if  not  return  to 
step  1. 

It  has  been  proved  that  this  algorithm  converges  [17]. 

At  the  end  of  this  iterative  procedure,  the  system  provides  R+1  models,  for  the 
R  speakers  and  for  non-speech  data.  The  data  is  segmented  and  labeled  as  required. 

The  termination  criterion  used  here  was  based  on  the  regrouping.  Termination 
was  declared  when  two  consecutive  iterations  showed  no  change  in  the  clusters.  It 
is  of  course  possible  to  use  a  less  restrictive  criterion  which  will  require  that  two 
consecutive  iterations  vdll  exhibit  a  change  of  no  more  than  a  given  predetermined 
level.  The  use  of  such  a  criterion  will  reduce  computation  time  at  the  expense  of 
accuracy. 


CLASSIFICATION  ERROR  EVALUATION 


The  algorithm  is  based  on  the  classification  of  0.5  second  segments.  Each 
segment  may  be  assigned  to  one  model  (speaker  or  not-speech  model)  or,  in 
transient  segments,  due  to  the  finite  resolution,  may  be  common  to  two  models  or 
more.  The  definition  of  the  classification  error  is  clear  in  the  non-transient 
segments.  In  case  of  transient  segments,  the  correct  assignment  may  be  to  either 
one  of  the  correct  models.  Obviously,  it  makes  sense  to  define  classification  error 
that  takes  in  account  a  segment  split  between  models.  A  linear  piecewise 
classification  error  weight  is  used  here. 

Figure  2  shows  10  seconds  (200  fi-ames  per  second)  of  manually  classified 
speech  and  the  error  weighting.  The  dashed  lines  show  an  example  where  a 
segment  includes  speech  from  both  the  first  and  the  second  speakers. 

The  error  weighting  has  been  defined  as  follows: 

1.  From  the  manual  segmentation  of  the  speech,  all  transient  times,  namely  the 
switching  times  between  speakers  were  found  and  denoted: 

2.  In  the  neighborhood  of  every  transient  time  a  local  error  weighting  function, 
was  defined  as: 


^Jn)  =  1 


Ljl 


I  I  i 

<  y 

Otherwise 


(4) 
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where  L  is  the  segment’s  duration  (L=500msec.  in  our  case). 

3.  Sum  all  the  local  weighting  functions  and  subtract  (  A/  —  l)  : 

g(/j)=£w„(«)-(A/-l) 

W=1 

4.  The  general  weighting  function  will  be: 


H<m)  = 


;  g(«)  >  0 

;  otherwise 


(5) 


(6) 


Fig  2:  Error  weighting  function  (“0”-non-speech,  “1”-  speaker  A,  “2”-  speaker  B). 
a)  10  seconds  of  manual  segmentation,  b)  Weighted  error  function. 


THE  DATA  BASE 

The  Hebrew  data  base,  used  for  the  evaluation  of  the  system,  consisted  of  9 
files  with  two  speakers,  3  files  with  three  speakers,  all  of  high  quality  speech 
dialogue,  and  12  telephone  dialogues.  The  duration  of  the  high  quality  speech  files 
were  72-180  seconds  per  file.  Telephone  files  duration  were  about  two  minutes  per 
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file.  The  high  quality  speech  (7.8khz  bandwidth)  was  sampled  in  an  acoustic  room, 
at  16KHz  sampling  rate  and  12Bit  resolution.  Five  males  and  one  female 
participated  in  the  conversations.  One  of  the  speakers  took  part  in  all  dialogues. 
The  telephone  quality  dialogues  were  recorded  from  the  telephone  line.  The  data 
was  filtered  by  a  3.8KHz  low  pass  filter  and  sampled  at  8KHz  sampling  rate,  with 
12Bit  resolution.  Twenty  four  male  speakers  participated  in  the  dialogues. 


RESULTS 

The  algorithm  was  evaluated  with  the  small  data  base  described  above.  All  high 
quality  conversations  between  two  speakers,  where  tested  on  segments  of  half 
second,  without  overlapping.  It  was  found  that  most  of  the  errors  between 
automatic  and  manual  segmentation  were  due  to  transitional  segments  and  the 
relatively  poor  resolution  of  the  system.  Table  1  shows  a  sample  of  the  results  for 
high  quality  speech. 

TABLE  1:  CONFUSION  MATRIX  (HIGH  QUALITY  SPEECH) 


weighted  error 


A 

B 

NS 

A 

93.5 

0 

1.6 

B 

4.3 

94.9 

3.4 

NS 

2.2 

5.1 

95.0 

Total 

Error 

m 

5.6 

For  high  quality  speech,  using  the  part  of  the  data  base  with  two  males 
conversation,  the  error  was  between  5.5%  and  6.0%.  For  male/female  dialogue  the 
error  was  only  4.3%. 

The  algorithm  was  also  evaluated  with,  two  speakers,  telephone  quality  speech. 
When  12*  order  cepstrum  features  were  used  and  half  a  second  segments  without 
overlapping  were  employed,  classification  results  were  very  poor  (11-47% 
weighted  error).  Augmenting  the  features  vector  with  12  delta-cepstrum  features 
and  75%  overlapping  -  the  results  improved  significantly.  Eight  (out  of  the  total  of 
12)  conversations  significantly  improved  their  classifications  (approximately  6% 
weighted  error).  The  confusion  matrix  is  presented  in  table  2.  Other  4  (out  of  the 
12)  did  not  converge.  These  four  files  were  examined  by  a  human  listener.  The  files 
were  found  to  be  of  very  low  quality.  One  of  the  files  was  judged  by  the  listener  as 
having  three,  rather  than  two  speakers. 

We  have  tried  to  apply  a  3  speakers  (plus  non-speech)  networks  to  these  files. 
The  non-speech  segments  were  all  well  classified.  Two  of  the  four  files  converged 
into  two  separate  clusters  (with  error  of  about  15%)  and  one  extra  cluster  that 
contained  segments  of  both  speakers.  The  third  file  had  one  good  cluster,  one 
cluster  containing  segments  of  both  speakers  and  one  extra  cluster  that  contained 
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simultaneous  speech.  The  last  file  converged  into  two  good  clusters  and  one  extra 
cluster  that  contained  breath  sounds,  coughs  and  other  interferences. 

High  quality,  three  speakers  conversation  files  were  processed  with  features 
and  segmentation  similar  to  the  ones  used  in  telephone  quality  speech.  The  results 
were  not  as  good  as  for  two  speakers  case  (19.5%  classification  error). 


TABLE  2:  TELEPHONE  CONVERSATION, 
CONFUSION  MATRIX. 


Figure  3  shows  a  10  seconds  of  segmented  speech.  It  can  be  seen  that  except 
for  veiy  short  segments  of  non-speech  at  the  beginning  and  at  8  seconds,  there  is 
an  agreement  between  the  manual  and  automatic  segmentation. 


Fig.  3:  10  second  classification  of  telephone  conversation. 

(“0”-non-speech,  “1”-  speaker  A,  “2”-  speaker  B). 
a)  manual  segmentation,  b)  SOM  networks  segmentation 


Figure  4  shows  an  example  of  the  system’s  convergence  as  a  function  of 
iteration  number.  In  case  of  two  speakers,  thirty  to  forty  iterations  are  needed  for 
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convergence  for  one  minute  of  telephone  conversation.  For  two  minutes  of  data  , 
50-65  iterations  are  usually  needed.  However,  after  20  iterations,  the  system 
usually  yields  results  close  to  optimal.  In  practice,  about  20  iterations  will  be 
needed. 


Fig.  4:  The  weighted  error  as  a  ftinction  of  iteration  number,  An  example  for 
convergence  determination. 


CONCLUSIONS 

A  new  architecture  for  unsupervised  speaker  classification  was  presented,  using 
Kohonen  SOM.  With  two  speakers,  and  cepstral  coefficients,  both  high  quality  and 
most  telephone  quality  conversations,  yielded  classification  errors  of  about  6%. 

The  same  data  base  was  used  with  an  unsupervised  algorithm  based  on  HMM 
[9].  The  results  of  the  two  algorithms  are  compatible.  More  work  is  required  in 
order  to  determine,  with  sufficient  statistical  significance,  whether  one  algorithm  is 
more  accurate  than  the  other.  The  computation  time  difference  between  the  two 
must  also  be  further  examined  before  a  conclusive  comparison  can  be  made. 

The  algorithm  achieves  better  results  for  conversations  between  male  and 
female.  This  result  is  not  surprising,  because  of  the  differences  in  the  voice 
characteristics  of  the  sexes. 

For  conversations  with  3  speakers,  classification  error  is  about  20%.  The  errors 
appear  between  the  speakers.  The  non-speech  model  yields  about  5%  error,  similar 
to  the  two  speaker  case.  Note  that  the  data  duration  in  two  and  three  speakers 
experiments  were  approximately  the  same.  More  data  may  be  needed  in  order  to 
improve  the  three  speakers  results. 

The  current  algorithm  assumes  that  the  number  of  speakers  is  known.  We  are 
currently  in  the  process  of  developing  a  validity  algorithm  which  will  estimate  the 
number  of  speakers  participating  in  the  conversation.  In  addition  we  are  working 
on  increasing  the  resolution  of  the  algorithm. 

A  pre-processing  algorithm  will  be  developed  to  detect  incidents  of 
simultaneous  speech,  to  allow  automatic  removal  of  such  incidents. 
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ABSTRACT 

This  paper  explores  a  neural  network  hardware  structure  with 
distributed  neurons  that  exhibits  useful  properties  of  self-scaling 
and  averaging.  In  conventional  sigmoidal  neural  networks  with 
lumped  neurons,  the  effects  of  weight  errors  and  mismatches 
become  more  noticeable  at  the  output  as  the  network  becomes 
larger.  It  is  shown  here  that  based  on  a  stochastic  model  the 
inherent  scaling  property  of  a  distributed  neuron  structure 
controls  the  output  noise  (error)  to  signal  ratio  as  the  number  of 
inputs  to  an  Adaline  increases.  Moreover,  the  averaging  effect  of 
distributed  elements  minimizes  characteristic  variations  among 
neurons.  These  properties  altogether  provides  a  robust  hybrid 
hardware  with  digital  synaptic  weights  and  analog  neurons.  A 
VLSI  realization  and  an  application  of  this  neural  structure  are 
explained. 

1.  INTRODUCTION 

One  of  the  problems  in  the  hardware  implementation  of  a  neural  network  is  output 
error  caused  by  various  non-ideal  elements.  Analog  neural  network  circuits  are 
generally  area-efficient  but  inaccurate,  i.e.  they  are  prone  to  problems  such  as  gain 
errors,  mismatches,  offsets  and  drifts.  In  digital  circuits,  on  the  other  hand,  the 
main  source  of  error  is  finite  word  length  which  in  the  case  of  synaptic  elements  is 
referred  to  as  weight  quantization  effect.  In  order  to  realize  dense  and  high-speed 
neural  networks  with  large  number  of  neurons  for  real  world  applications,  the  use 
of  simple  synapses  and  neurons  with  low  precision  weights  and  other  types  of 
non-idealities  is  unavoidable.  The  effect  of  implementation  errors  especially 
becomes  more  noticeable  at  the  output  when  the  network  becomes  larger  [1]. 

In  this  paper  we  study  a  hybrid  analog-digital  neural  network  hardware  with 
distributed  neuron  structure.  Here,  we  are  only  concerned  about  errors  and 
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quantization  effects  in  recall  hardware  or,  in  other  words,  the  implementation  of  an 
ideally-trained  network. 

Sensitivity  to  weight  errors  of  neural  networks  with  increasing  number  of  neurons 
is  analyzed  in  [1].  A  stochastic  model  is  developed  to  study  an  ensemble  of 
networks  with  differing  weights  and  the  focus  is  on  implementation  of  recall 
phase.  We  modify  this  model  to  study  the  properties  of  a  distributed  neuron 
hardware  structure.  It  will  be  shown  here,  based  on  the  modified  model,  that  the 
self-scaling  property  of  a  distributed  neuron  controls  its  stochastic  gain  and  hence 
reduces  the  output  noise  (error)  to  signal  ratio  when  the  size  of  the  network  grows. 
Also,  the  averaging  effect  of  distributed  elements  minimizes  mismatches  across  a 
sizable  chip.  A  VLSI  realization  and  an  application  of  this  hardware  structure  are 
also  presented  along  with  some  simulation  and  experimental  test  results. 


2.  IMPLEMENTATION  OF  DISTRIBUTED  NEURON 

Figure  1  shows  an  Adaline  with  distributed  neurons.  In  general,  two  main  tasks  of 
a  neuron  are  summation  of  synaptic  inputs,  and  nonlinear  saturating  function.  If  an 
Adaline  is  built  with  transconductance  synapses  and  nonlinear  resistive  neurons, 
then  summation  is  performed  simply  by  hardwiring  the  outputs  of  synapses 
together.  As  shown  in  Fig.  1,  a  neuron  of  this  type  can  be  distributed  into  parallel 
elements  having  the  same  equivalent  characteristic.  In  this  case,  equivalent 
resistive  neuron  receives  the  summation  current  and  delivers,  on  the  same 
supernode,  an  output  voltage  nonlinearly  proportional  to  total  synaptic  input. 


Fig.  1  Hybrid  structure  with  distributed  neuron 


Each  element  of  a  distributed  neuron  can  be  integrated  with  one  synapse  to  form  a 
unified  synapse- neuron  (USN)  block.  A  reconfigurable  network  based  on 
distributed  neurons  is  discussed  in  [2].  A  hybrid  analog-digital  USN  is  presented 
by  the  authors  in  [3],  Here,  experimental  test  results  are  presented  from  a  recent 
CMOS  fabrication.  Fig.  2-a  shows  transistor-level  diagram  of  a  distributed 
neuron.  Each  element  of  neuron  receives  an  average  of  input  synaptic  currents 
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(lave)  and  performs  a  nonlinear  saturating  I-to-V  function  by  combining  the 
quadratic  characteristics  of  four  MOS  transistors. 


2.1.  Self-scaling  Property 

Fig.  2-b  shows  the  measured  characteristics  of  2-input  and  5-input  distributed 
neurons  fabricated  in  1.2)1  CMOS.  Distributed  neuron  structure  exhibits  an 
interesting  self-scaling  property.  As  the  number  of  synaptic  inputs  (i.e.  the 
number  of  neurons  in  previous  layer)  increases,  the  overall  nonlinear  characteristic 
stretches  by  itself.  This  property  restores  information  received  from  extra  inputs 
that  would  have  been  lost  otherwise  in  large  saturation  areas  of  a  fixed  lumped 
neuron  with  increasing  number  of  inputs. 


Fig.  2  b  Experimental  test  results  and  self-scaling  property 


In  fact,  different  applications  require  different  number  of  neurons  and  neuron 
inputs.  When  the  number  of  inputs  to  a  lumped  neuron  increases,  over-saturation 
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occurs.  One  method  to  circumvent  this  situation  is  to  reduce  synaptic  activity  via 
scaling  down  each  weight  by  an  arbitrary  factor.  This  method  is  handy  in  software 
implementation.  Equivalently,  we  should  be  able  to  use  the  same  net  input 
combined  with  a  scaled  activation  function.  Distributed  neuron  structure  presents 
such  scaling  scheme.  In  fact,  if  the  number  of  inputs  to  a  distributed  neuron  is 
increased  by  a  factor  5,  neuron  will  consist  of  S  similar  nonlinear  resistive  blocks 
in  parallel  (each  possibly  consisting  of  N  original  sub-blocks;  here  we  assume 
N=\).  As  current  divides  equally  among  S  similar  blocks  (each  block  receiving  an 
average  current  lave)  output  voltage  can  be  obtained  in  two  alternative  ways.  If  we 
consider  the  overall  nonlinear  function  F(.)  we  have,  Vout  =  the 

other  hand,  regarding  each  individual  block  with  nonlinear  resistive  function  /(.) 

we  can  write:  =/(W.)  =  /(%)  =  /(i^)  =  /(2;^.V*).  Since  the 

two  voltages  must  be  the  same,  we  conclude: 

F(Isum)  =  f('L^-Vk)  (1) 

Therefore,  a  distributed  neuron  exhibits  a  self-scaling  property  which  is  equivalent 
to  scaling  down  all  the  weights  proportional  to  increase  in  the  number  of  inputs. 
This  property  will  be  used  in  Section  3  to  define  a  stochastic  model  for  distributed 
neuron  and  to  quantify  the  improvement  obtained  from  this  structure. 

2.2.  Averaging  Effect 

Analog  lumped  neurons  implemented  at  different  locations  across  a  sizable  chip  are 
subject  to  noticeable  variations  in  their  expected  characteristics.  To  demonstrate  a 
worst  case  scenario,  2-input  neurons  with  the  same  circuits  as  explained  before  are 
laid  out  as  lumped  cells  at  various  locations,  including  corner  positions,  on  a  test 
chip.  Measurements  are  performed  on  different  cells  and  repeated  over  five 
fabricated  chips.  The  worst  case  on-chip  variations  of  the  characteristic  is  found 
between  two  corner  cells  as  66mV  in  5  Volt  range  (i.e.  analog  accuracy  of  1.3%, 
approx,  equivalent  to  1  of  6  bits  resolution).  In  Fig.  2-c  a  typical  measured 
characteristic  is  shown  on  the  left  and  a  close-up  of  the  worst  case  curves  around 
5V  is  shown  on  the  right.  The  existing  variations  are  related  to  fabrication  process 
parameters,  mainly  the  gradient  of  threshold  voltage,  VT,  across  the  silicon  die. 


The  advantage  of  a  truly  distributed  neuron  can  be  observed  when  the  building 
elements  are  distributed  in  one  or  two  dimensions  across  the  chip.  In  this  case,  an 
average  of  various  characteristics  is  obtained  which  corresponds  to  average  process 
parameters.  The  characteristic  variations  between  two  averaged  neurons  built  in 
this  manner  is  reduced  only  to  the  variability  of  two  adjacent  cells.  In  the  case  of 
our  test  chip,  this  mismatch  was  26  mV  in  5V  range  or  0.5%;  as  opposed  to  1.3% 
for  the  case  of  lumped  corner  cells. 


3.  STOCHASTIC  MODEL  FOR  DISTRIBUTED  NEURON 

For  a  conventional  Madaline  with  lumped  neurons,  a  stochastic  model  defines  the 
ideal  output  of  node  n  in  layer  /  and  the  corresponding  output  error  as  follows  [1]: 

Y I  -  /(X/^.  W/  „)  (  Xi^  stands  for  transposed  matrix)  (2) 


AY,,„  =  /((X,  +  AX,f.(W;,„+AW,,„))-/(X/.W,.„)  (3) 


W/^  ,  X/  ,  and  AX/  are  independent  identically  distributed  (iid)  random 

vectors  representing  weights,  inputs,  weight  errors  and  input  errors  respectively. 

Based  on  this  model,  output  Noise-to-Signal  Ratio  of  layer  I ,  defined  as  the  ratio 
of  the  variance  of  the  output  error  of  layer  /  to  the  variance  of  the  ideal  output  of 
layer  / ,  is  formulated  as  follows; 


NSRi  = 


_  <y  Ayi  _ 


2 

yi 


=  «(VNa,a„)x(^+^) 


(4) 


The  output  NSR  of  a  sigmoidal  Adaline  is  expressed  as  a  linear  combination  of 
input  NSR,  (P'Axl(y^x  and  weight  NSR,  ct^Aw/o'^w,  and  is  amplified  by  a 
stochastic  gain  function  g>  1.  Gain  g  is  an  increasing  function  of  its  argument, 

^^N ’  where  N  is  the  number  of  inputs  to  Adaline,  and  and  are 
standard  deviations  of  input  and  weight,  respectively. 


Thus,  in  a  conventional  neural  network  with  lumped  sigmoidal  neurons  an  increase 
in  the  number  of  inputs,  causes  an  increase  in  the  stochastic  gain,  g,  and  hence  an 
unwanted  increase  in  output  NSR.  If  the  number  of  inputs  to  the  Adaline  increases 
by  a  factor  S  and  the  input  and  weight  variances  do  not  change,  in  the  absence  of 
any  scaling  scheme  output  NSR  will  increase  as  follows: 

2  2 

NSH  =  g(^f!^(7^a„)x(^  +  ^-^)  (5) 

Ox  O  w 


In  a  distributed  neuron  structure,  on  the  other  hand,  the  characteristic  of  a  neuron  is 
changed  adaptively  based  on  the  number  of  inputs.  This  self-scaling  effect  is  a 
natural  way  of  controlling  g  and  hence  decreasing  NSR. 
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It  has  been  shown  earlier  in  Eq.  (1)  that  self-scaling  property  of  a  distributed 
neuron  is,  in  effect,  equivalent  to  scaling  down  all  the  weights  to  the  same 
saturating  function  while  the  number  of  inputs  increases.  Now,  let  us  investigate 
the  effect  of  this  on  output  Noise  (error)  to  Signal  Ratio  of  an  Adaline. 

2  2  2 

If  W5  =  w/5  is  defined  as  scaled  weight,  then  we  have  cr  ^5  =cr  w! S  and 
^  •  Thus, 


NSR  =  )  X  X 

^  O^x  CTws  yS  Ox 


The  terms  in  linear  combination  remain  unchanged,  while  the  gain  factor  is  reduced 
due  to  the  scaling  of  its  argument  by  1  /  .  The  corresponding  stochastic  model 

is  shown  in  Fig.  3.  This  property  reduces  NSR  and  improves  the  performance  of 
recall  hardware  especially  in  large  networks.  The  improvement  is  shown  here  by 
an  example. 

2 

cr  Aw 
2 

<7  w 


=  NSR 


Fig.  3  Stochastic  model  of  an  Adaline  with  distributed  neuron 

Example:  Suppose  an  Adaline  with  N -25  inputs,  for  which  inputs  and 
weights  are  uniformly  distributed  over  the  range  [a,b]  =  [~2,2];  therefore 

o^x  =  cP'yj  =  (^  -  / 12  =  4  /  3  .  For  an  8-bit  quantization  scheme,  weights  are 

quantized  to  levels  equally  spaced  by  ^  =  1  /  64  ;  thus,  weight  error  variance  will 

be  crAw=<3'  /12«2xl0  .  Further,  we  assume  o  ixw.  For 

VAcr^cr^>2  ,  gain  function  defined  in  [1]  may  be  approximated  as: 

^(VWC7^CJ^.)  =  0.5  +  0.  534^0^0^,  Output  noise-to-signal  ratio  of  this  Adaline 
2  2 

is:  NSR  =  g{4Na^a^).i^-^  +  =  <g(6.67).(3 x  10"'^)  -  1.2 x  10"^  -  -39 dB. 

(J  X  (J  w 

Now  if  the  number  of  inputs  are  increased  to  100  (an  increase  by  factor  5=4),  for  a 
conventional  lumped  neuron  characteristic,  NSR  is  found  from  (5)  as: 

NSR  =  g(13.3).(3  X 10"^)  ==  2.3 X 10“^  ~  -36AdB.  If,  instead,  we  use  a  distributed 

neuron  structure,  then  from  (6)  we  will  obtain:  A5^  =  g(3.33).(3xl0“^) 

~  6.8 X 10"^  =  -41. 7 JR.  In  this  example,  NSR  is  reduced  almost  by  a  factor  of  3, 
i.e.  more  than  5dB  improvement.  The  effect  would  be  even  more  noticeable  for 
larger  scaling  factors. 
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4.  VLSI  APPLICATION 

Figure  4  shows  the  schematic  diagram  of  a  4-3-2  hybrid  VLSI  neural  network  with 
distributed  neuron  structure  that  is  designed  and  implemented  in  1 .2\x  CMOS  using 
Cadence  tools.  This  network  is  built  with  23  similar  blocks.  Eighteen  of  these 
blocks  are  unified  synapse-neurons  (USN)  each  consisting  of  a  Multiplying  DAC 
synapse  with  5-bit  sign-magnitude  weight  and  a  portion  of  a  distributed  neuron. 
Five  remaining  blocks  are  used  for  neuron  bias  (threshold)  adjustment.  These 
units  are  in  fact  the  same  USNs  on  silicon  with  their  nonlinear  load  disactivated. 


Fig.  4  Schematic  diagram  of  VLSI  implementation 


This  network  is  trained  off-line  for  a  4-input  template  matching  problem.  An 
interactive  Back-Propagation  simulator  based  on  XView  programming  on  Sun- 
Sparc  is  used  that  allows  user  to  define  network  architecture  and  I/O  patterns.  The 
resulting  weights  are  rounded  off  to  the  resolution  of  hardware  (5  bits)  and  a 
simulated  recall  is  followed.  When  this  final  phase  is  passed,  the  weights  will  be 
programmed  on  chip.  Weights  (w's)  and  bias  values  (b’s)  for  the  above  circuit  are: 


wt'>  = 


■-5 

0  -5‘ 

6 

9  8 

■-14  6  ■ 

5  -15 

12 

-10  1 

-7  -8  _ 

-5 

0  -5_ 

6"^,,  =  4  (V/,n) 
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Figure  5  a  shows  four  templates  earlier  introduced  to  the  network  during  training. 
Outputs  1  and  2  respond  to  templates  1  and  2,  respectively.  Both  outputs  remain 
zero  for  templates  0  and  3.  A  complete  transistor-level  simulation  is  performed  on 
693  active  and  about  900  parasitic  elements  of  recall  circuit  shown  in  Fig.  4. 
Figure  5  b  shows  typical  post-layout  results  at  1  MHz  input  vector  rate.  To 
calculate  each  output  vector,  hybrid  neural  network  performs  18  multiplications,  5 
additions  and  5  distributed  nonlinear  operations;  an  equivalent  calculation  rate  of 
28x  10^  /Sec,  Current  and  power  consumption  on  Vdd=5V  are:  4^^.  ~  730jUA 
and  ~3.65miy.  The  circuit  is  functional  at  higher  speeds  or  on  lower 
supply  voltages.  For  example,  on  supply  voltage  Vdd=3.3V,  specifications  are 
4ve.  ~155/M  and  ~0.5mW,  i.e.  86%  power  saving  compared  to 
consumption  on  Vdd=  5Volts. 


Template  I  Template  I  Template  I  Template 

0  I  1  I  2  I  3 


Cadence  Waveform  Display  (x  in  seconds] 


Fig.  5  Template  matching  problem  and  results 
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A  chip  is  designed  in  1.2|X  CMOS  (see  Fig.  6)  that  contains  two  different  versions 
of  this  network:  1)  A  weight-programmable  network  with  electronic  inputs,  2)  A 
network  with  photosensitive  inputs  and  pre-programmed  weights  for  an  optical 
template  matching  application. 

With  programmable  synaptic  weights  network  can  be  programmed  to  recognize 
different  templates  to  be  detected  in  different  applications.  The  particular 
application  that  we  are  interested  in  is  feature  extraction  in  a  handwritten  numeral 
recognition  system.  This  system  consists  of  three  stages:  preprocessing,  feature 
extraction  and  classification.  Directional  border  codes  are  the  features  to  be 
extracted  by  this  network  [4].  The  basic  process  performed  on  a  typical 
handwritten  number  and  the  2x2  directional  templates  for  feature  extraction  are 
illustrated  in  Fig.  7.  More  details  about  this  system  can  be  found  in  [5]. 

The  same  VLSI  building  blocks  are  conveniently  used  in  the  design  of  larger 
neural  networks  (e.g.  [6]y.  In  a  network  with  16  inputs,  the  required  weight 
resolution  for  recall  is  still  five  (1  sign  and  4  magnitude)  bits.  If  lumped  neuron 
blocks  were  used  in  a  larger  network;  however,  the  quantization  noise  and 
mismatches  would  increase  the  output  noise  to  signal  ratio  and  in  order  to  control 
NSR  and  avoid  misclassification  one  has  to:  a)  increase  the  weight  resolution  and 
analog  circuit  accuracies,  or  b)  redesign  new  neuron  blocks  to  effectively  reduce 
their  gain  factor  against  implementation  errors. 


5.  CONCLUSION 

A  robust  neural  network  hardware  structure  with  analog  distributed  neurons,  digital 
weights  and  multiplying  DAC  synapses  is  presented.  Two  properties  of  this 
structure,  namely  self-scaling  and  averaging  of  neurons,  are  emphasized  that  both 
reduce  the  effect  of  some  implementation  errors.  Experimental  test  results  are 
presented  for  building  blocks  and  a  VLSI  application  is  described. 
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Fig.  7  Border  feature  extraction  and  directional  templates 
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ABSTRACT; 

This  paper  applies  multilayer  neural  networks  to  the 
problem  of  forecasting  the  flow  of  the  River  Nile  in 
Egypt.  Estimating  the  flow  of  the  River  Nile  can  have 
significant  economic  impact,  since  it  can  help  in  manag¬ 
ing  scarce  irrigation  water.  The  second  goal  of  the  pa¬ 
per  is  utilize  the  time  series  as  a  benchmark  to  compare 
between  different  neural  network  forecasting  methods. 
We  compeure  between  four  different  methods  for  input 
and  output  preprocessing,  including  a  novel  method  pro¬ 
posed  here  based  on  the  Discrete  Fourier  Series.  We  also 
consider  the  problem  of  forecasting  several  steps  ahead. 
We  compare  between  three  methods  for  the  multistep 
ahead  prediction  problem. 


INTRODUCTION: 

Neural  networks  have  been  used  in  many  forecasting  applica¬ 
tions,  for  example  stock  market,  exchange  rate  [1],  and  electric 
load  forecasting  [2].  We  present  here  yet  another  forecasting  ap¬ 
plication,  namely  river  flow  forecasting.  Forecasting  river  flows  is 
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an  important  application,  that  attracted  the  interest  of  scientists 
since  more  than  50  years.  It  can  help  in  predicting  agricultural 
water  supply,  predicting  potential  flood  damage,  estimating  loads 
on  bridges,  etc. 

In  this  paper  the  first  goal  is  to  apply  neural  networks  to 
the  problem  of  predicting  the  flow  of  the  River  Nile  in  Egypt. 
In  addition,  we  exploit  the  river  forecast  problem  as  a  bench¬ 
mark  to  compare  between  different  neural  network  forecasting 
approaches.  The  approaches,  which  all  use  multilayer  networks 
with  backpropagation  training,  are  mainly  different  approaches 
to  preprocess  the  time  series  and  different  approaches  to  solve 
the  multistep  ahead  forecast  problem. 


ON  THE  FLOW  FORECAST  PROBLEM: 

The  river  flow  forecasting  problem  has  been  traditionally  tackled 
using  linear  techniques,  such  as  AR,  ARM  AX,  and  Kalman  filter, 
and  also  using  nonlinear  regression  (see  [3] -[6]).  We  have  found 
only  one  method  in  literature  that  uses  neural  networks,  namely 
for  the  Huron  River  in  Michigan  [6].  Most  of  the  forecasting 
methods  consider  one-day  ahead  forecast.  For  the  River  Nile  a 
longer  term  forecast  is  more  of  interest,  though  it  is  more  difficult 
than  the  one-day  ahead  problem. 

The  problem  of  flow  forecast  for  the  River  Nile  is  particularly 
important  for  Egypt.  Egypt  depends  almost  exclusively  on  the 
Nile  for  agricultural  irrigation.  The  flow  of  the  River  Nile  exhibits 
a  seasonal  behavior.  The  flow  is  low  during  the  winter  months, 
and  peaks  during  the  months  of  August  and  September.  The 
High  Dam  of  Aswan  (located  South  of  Egypt)  retains  incoming 
water,  and  releases  it  in  a  more  uniform  way,  so  as  to  optimally 
fill  agricultural  and  electricity  generating  needs  (acting  like  the 
capacitor  or  the  reservoir  effect).  Forecasting  the  flow  of  the  river 
Nile  can  help  in  determining  the  optimum  amount  of  water  to 
release,  and  thus  can  help  to  more  efficiently  manage  the  water. 
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NEURAL  NETWORK  APPLICATION: 


We  used  readings  of  the  average  daily  flow  volume  for  each  ten- 
(Jay  period  at  the  Dongola  station,  located  in  Northern  Sudan. 
The  readings  spanned  the  period  from  1974  to  1992.  We  used  a 
network  consisting  of  one  hidden  layer,  with  3  hidden  nodes  in  the 
hidden  layer,  and  trained  it  using  the  standard  backpropagation 
method.  The  network  is  trained  for  4000  iterations.  We  used  150 
points  for  each  of  the  training  stage  and  the  testing  stage.  We 
used  as  error  measurement  the  RMS  normalized  error,  defined 
as  the  square  root  of  the  sum  of  squared  errors,  divided  by  the 
square  root  of  the  sum  of  squared  desired  outputs.  Henceforth, 
when  mentioning  the  “error”,  we  will  be  meaning  the  RMS  nor¬ 
malized  error.  As  for  the  crucial  design  part  of  determining  what 
inputs  to  use  for  the  neural  network,  we  have  experimented  with 
the  following  inputs: 

1)  the  flows  at  the  previous  few  time  periods, 

2)  the  flow  at  the  same  time  period  one  year  ago  and  two  years 
ago, 

3)  the  average  of  the  flow  of  the  last  12  months, 

4)  the  period  number  (scaled  by  the  number  of  periods),  e.g. 
for  the  month  of  May  the  input  would  be  5/12. 

We  have  performed  simulations,  and  found  the  following  obser¬ 
vations: 

1)  There  is  a  strong  correlation  between  the  training  error  and 
the  testing  error.  This  means  that  there  is  good  generaliza¬ 
tion.  Choosing  a  network/input  set  that  gives  a  low  training 
error  will  almost  surely  result  in  a  low  testing  error. 

2)  In  several  exploratory  runs  we  have  found  that  no  valida¬ 
tion  set  was  needed  to  determine  optimal  stopping  point  in 
training.  The  error  for  the  test  set  goes  down  uniformly  with 
iteration  and  does  not  bottom  out. 

3)  For  the  majority  of  input  combinations  the  results  were 
somewhat  similar.  There  were  several  cases  which  gave 
higher  errors,  but  these  were  mostly  for  cases  with  insuf¬ 
ficient  number  of  inputs,  and  one  can  eliminate  them  easily 
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by  observing  the  training  error. 

4)  The  forecasts  for  all  periods  of  the  year  were  quite  accurate. 
Only  for  the  peak  flow  periods  there  was  some  small  error 
(see  Figure  1  for  the  test  forecast  results  of  a  ten-day  ahead 
case).  This  suggests  possibly  training  a  separate  network  for 
the  high  flow  periods,  and  using  some  kind  of  gating-type 
network  (see  [7],  [8],  [9]). 


Figure  1:  The  results  of  the  ten- day  ahead 
forecast  for  the  test  period 


ALGORITHM  COMPARISON: 

In  addition  to  this  basic  neural  network  implementation,  we  used 
the  data  as  a  benchmark  to  compare  between  different  neural 
network  forecasting  methods.  The  methods  we  used  are  the  fol- 
lowoing: 
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Method  1)  The  neural  network  is  trained  to  forecast  the  actual 
flow  of  the  next  time  period. 

Method  2)  The  neural  network  is  trained  to  forecast  the  dif¬ 
ference  in  flow  between  next  period’s  flow  and  current  period’s 
flow  (the  desired  output  for  the  neural  network  at  time  t  is 
x{t  -\-l)  —  x(t)^  properly  scaled). 

Method  3)  We  subtract  the  seasonal  average  of  the  flow  to  create 
a  seasonally  adjusted  time  series.  Then  we  apply  the  neural  net¬ 
work  to  forecast  this  seasonally  adjusted  series.  This  approach, 
hopefully,  makes  it  a  simpler  problem  for  the  neural  network,  by 
letting  the  neural  network  concentrate  on  forecasting  the  devi¬ 
ations  from  the  seasonal  average,  rather  than  estimating  both, 
the  seasonal  average  and  the  deviation. 

Method  4)  A  novel  algorithm  we  propose  here,  that  might  be 
particularly  suitable  for  seasonal  (cyclical)  time  series.  Let  x{t) 
be  the  time  series  values,  and  let  the  period  be  fixed  (say  T),  but 
however  the  time  series  varies  from  cycle  to  the  other,  and  can 
have  a  longer  term  variation  or  trend.  Because  of  the  seasonality 
of  the  time  series,  the  Fourier  series  will  be  a  natural  represen¬ 
tation  of  such  series,  and  will  carry  much  useful  information  rel¬ 
evant  to  the  task  of  forecasting  the  series.  Therefore,  it  seems  to 
be  potentially  an  effective  method  to  predict  the  Fourier  coeffi¬ 
cients,  by  giving  them  as  input  to  the  neural  network.  For  every 
time  t  we  calculate  the  Discrete  Fourier  Series  (DFS)  of  the  points 
aT(t— T-bl),  ...,a:(t),  to  obtain  DFS  coefficients  Xo{t), ...,  A't-i(^)' 
These  represent  the  result  of  a  moving  window  of  the  DFS  cal¬ 
culation.  We  train  a  separate  neural  network  for  every  DFS 
coefficient.  The  network  takes  previous  values  of  the  DFS  coefi- 
cient  time  series  as  inputs  (that  is  Xn{t  —  4- 1), ...,  Xn{t))^  and 

is  trained  to  estimate  the  future  coefficient  Xn{t  +  1).  Once  all 
T  forecasts  from  the  T  neural  networks  are  available,  they  are 
inverted  to  obtain  an  estimate  for  the  whole  signal  in  the  T-long 
period  from  t  —  T  2  till  The  estimate  of  the  signal  at  the 

end  of  the  period  (at  time  i  +  1)  is  taken  as  the  signal  forecast. 

In  addition  to  these  methods,  we  considered  the  problem  of  fore- 
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casting  several  steps  away  (the  multistep  ahead  problem).  Thus, 
we  have  a  time  series  x(  1 a;(t),  and  would  like  to  forecast 
x{t  -h  k),  where  A;  >  1.  Of  course  the  larger  k  is,  the  more  dif¬ 
ficult  the  problem  is.  We  have  considered  several  methods  to 
implement  the  multistep  ahead  forecast  problem,  and  we  will 
compare  between  these  methods.  The  methods  are: 

Method  5)  Direct  Method:  We  train  the  neural  network  to  di¬ 
rectly  forecast  the  k^^  period  ahead.  Thus,  the  desired  output 
for  the  network  will  be  -f  A;). 

Method  6)  Recursive  Method:  We  consider  a  network  that  fore¬ 
casts  a  single  step  ahead,  and  apply  this  network  recursively  to 
forecast  k  step  ahead.  Thus,  at  any  intermediate  step  the  net¬ 
work  will  use  some  of  the  forecasts  it  obtained  at  previous  steps 
as  inputs.  There  are  two  basic  methods  to  train  such  a  network. 
We  implemented  these  two  methods  in  the  comparison: 

a)  To  train  the  network  to  perform  simply  a  single  step  ahead 
forecast. 

b)  To  consider  a  backpropagation  through  time  scheme  [10]. 
That  is,  we  consider  the  k  steps  ahead  forecast  as  the  result 
of  a  cascade  of  identical  networks,  and  train  this  composite 
structure  of  networks. 


IMPLEMENTATION  RESULTS: 

We  have  performed  simulations  for  the  comparisons  for  each  of 
the  single  step  ahead  group  and  the  multistep  ahead  group.  For 
each  comparison  we  performed  five  different  runs  using  five  dif¬ 
ferent  combinations  of  some  of  the  inputs  described  last  section. 
Table  1  shows  the  results  for  the  test  period  for  the  single  step 
ahead  case.  One  can  see  that  the  most  bcisic  method,  forecasting 
the  actual  flow  (Method  1)  results  in  the  best  forecast  accuracy. 
The  results  were  generally  consistent  accross  all  five  rims,  mean¬ 
ing  that  the  variation  in  error  among  the  five  runs  was  small.  For 
the  multistep  group,  we  have  performed  both  a  two  step  ahead 
comparison  and  a  three  step  ahead  comparison.  One  can  see 
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that  the  direct  method  (Method  1)  is  superior,  especially  for  the 
three  step  ahead  case  (see  Table  2  and  Table  3  respectively).  The 
recursive  method  trained  using  a  backpropagation  through  time 
approach  was  much  better  than  that  trained  to  perform  a  single 
step  ahead  forecast. 

Although  the  comparative  performance  of  the  different  ap¬ 
proaches  is  usually  problem  dependent,  this  comparison  should 
give  some  insight,  and  is  therefore  an  addition  to  other  compar¬ 
ison  studies  such  as  [11]. 


Method 

Avg(Test  Error) 

StDev(Test  Error) 

Method  1 

0.213 

0.006 

Method  2 

0.240 

0.014 

Method  3 

0.290 

0.011 

Method  4 

0.287 

0.008 

Table  1:  Comparison  of  single  step  ahead  methods:  the 
average  and  the  standard  deviation  of  the  testing  error 


Method 

Avg(Test  Error) 

StDev(Test  Error) 

Method  5 

0.315 

0.007 

Method  6a 

0.685 

0.184 

Method  6b 

0.310 

0.008 

Table  2:  Comparison  of  multistep  ahead  methods:  the 
average  and  the  standard  deviation  of  the  testing  error 
for  the  two  step  ahead  case 


Method 

Avg(Test  Error) 

StDev(Test  Error) 

Method  5 

0.374 

0.021 

Method  6a 

0.623 

0.164 

Method  6b 

0.489 

0.080 

Table  3:  Comparison  of  multistep  ahead  methods:  the 
average  and  the  standard  deviation  of  the  testing  error 
for  the  three  step  ahead  case 


604 


ACKNOWLEDGEMENTS: 

The  authors  would  like  to  acknowledge  the  help  of  Dr.  Ashraf 
Ghanem  and  of  M.  Bayoumi  in  supplying  the  data. 

REFERENCES: 

[1]  Proc.  Neural  Networks  in  the  Capital  Markets  Conf., 

London  Business  School,  London,  U.K.,  October  1995. 

[2]  J.  Connors,  D.  Martin,  and  L.  Atlas,  “Recurrent  neural  net¬ 
works  and  robust  time  series  prediction”,  IEEE  Trans. 
Neural  Networks,  Vol.  5,  No.  2,  pp.  240-254,  March 
1994. 

[3]  H.  Awwad,  J.  Valdes,  and  P.  Restrepo,  “Streamflow  fore¬ 
casting  forHan  River  basin,  Korea”,  J.  Water  Resources 
Planning  and  Management,  Vol.  120,  No. 5,  pp.  651- 
673,  1994. 

[4]  M.  El-Fandy,  Z.  Ashour,  and  S.  Taiel,  “Time  series  mod¬ 
els  adoptable  for  forecasting  Nile  floods  and  Ethiopian  rain¬ 
falls”,  Bulletin  American  Meteorological  Society,  Vol. 
75,  No.  1,  pp.  1-12,  January  1994. 

[5]  D.  Burn  and  E.  McBean,  “River  flow  forecasting  model  for 
the  Sturgeon  River”,  Hydraulic  Engineering,  Vol.  Ill, 
No.  2,  pp.  316-333,  February  1985. 

[6]  N.  Karunanithi,  W.  Grenney,  D.  Whitley  and  K.  Bovee, 
“Neural  networks  for  river  flow  prediction”,  Computing 
in  Civil  Engineering,  Vol.  8,  No.  2,  pp.  203-219,  April 
1994. 

[7]  R.  Jacobs  and  M.  Jordan,  “A  competitive  modular  connec- 
tionist  architecture”,  in  Advances  in  Neural  Informa¬ 
tion  Processing  Systems  3,  R.  Lippmann,  J.  Moody, 
and  D.  Touretzky,  Eds.,  pp.  767-773,  Morgan  Kaufmann, 
San  Mateo,  CA,  1991. 

[8]  S.  Hay  kin,  Neural  Networks:  A  Comprehensive  Foun¬ 
dation,  IEEE  Press,  1994. 


605 


[9]  A.  Weigend,  M.  Mangeas,  and  A.  Srivastava,  “Nonlinear 
gated  experts  for  time  series:  discovering  regimes  and  avoid¬ 
ing  overfitting”,  Int.  Journal  of  Neural  Systems,  Vol. 
6,  pp.  373-399,  1995. 

[10]  P.  Werbos,  “Back-propagation  through  time:  what  it  does 
and  how  to  do  it”.  Proceedings  of  the  IEEE,  Vol.  78, 
No.  10,  October  1990. 

[11]  A.  Weigend  and  N.  Gerschenfeld  (Eds.),  Time  Series  Pre¬ 
diction:  Forecasting  the  Future  and  Understanding 
the  Past,  Santa  Fe  Institute,  Addison- Wesley,  1994. 


606 


ROBUSTNESS  OF  A  CHAOTIC  MODAL  NEURAL 
NETWORK  APPLIED  TO  AUDIO-VISUAL  SPEECH 

RECOGNITION 


Harouna  Kabre 


CLIPS-IMAG  Laboratory 
Joseph  Fourier  University 
Grenoble,  BP  53,  Bat  B,  38041,  France, 
e-mail:  Harouna.Kabre@imag.fr 


Abstract 

We  stabilized  a  chaotic  Modal  Neural  Network  (MNN)  for  the  purpose  of 
robust  speech  recognition.  A  Modal  Neural  Network  is  an  Artificial  Neural 
Network  system  which  includes  two  levels  of  information  processing.  The  first 
level  is  trained  to  store  and  retrieve  some  acoustic  and  visual  patterns.  The 
different  states  of  this  network,  which  represent  the  sound  classes  in  a  task  of 
speech  recognition,  are  called  modes  and  are  supposed  to  chaotically  evolve 
when  speech  recognition  is  performed  in  adverse  environments. 

The  control  of  the  chaotic  behavior  of  the  different  modes  constitutes  the  second 
level.  An  external  signal,  taken  from  a  visual  input  such  as  the  lip-opening 
parameters  of  the  speaker  is  applied  to  stabilize  an  acoustic  modal  network  of 
which  the  modes  are  moved  from  an  initial  position  to  a  target  position.  The 
addressed  task  is  the  audio-visual  recognition  of  the  10  French  vowels,  perturbed 
by  some  noises.  The  Perceptual  Linear  Predictive  analysis  applied  to  the  speech 
signal  of  the  10  vowels  outputs  some  vectors  formed  by  5  spectral  parameters. 
They  are  in  turn  fed  into  a  Modal  Neural  Network  implemented  as  a  feed¬ 
forward  network.  When  the  noise  level  increases,  the  classes  stored  by  the 
acoustic  MNN  exhibit  a  chaotic  behavior  which  is  stabilized  by  the  signal  given 
by  the  visual  path.  We  show  that  in  an  uncooperative  environment,  a  chaotic 
modal  neural  network  stabilizes  well. 

Key  Words:  Robustness,  Chaos,  Audio-Visual  Speech  Recognition,  Adaptation. 
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1.  INTRODUCTION 


Robustness  is  the  ability  for  a  given  system  to  face  unknown  situations.  In  the  speech 
domain  the  problem  for  a  speech  perceiver  (e.g.  an  Automatic  Speech  Recognition 
System)  is  keeping  good  performances  of  speech  recognition  even  if  the  spectral  and 
temporal  information  conveyed  in  the  speech  signal  are  perturbed  by  some 
undesirable  noises.  This  is  a  crucial  problem  because,  when  the  training  and  testing 
conditions  differ,  the  performances  of  the  speech  recognition  systems  can  be 
drastically  reduced,  thus  limiting  the  spread  of  speech  technology  to  real  world 
applications. 

Even  if  the  current  approaches  (which  are  based  on  the  improvement  of  the  quality  of 
microphones  or  on  a  better  speech  signal  processing)  have  given  encouraging  results 
[1],  they  are  all  based  on  the  idea  that  the  performance  degradation  of  speech 
perceivers  could  be  eliminated  by  decreasing  the  mismatch  between  their  training  and 
testing  conditions.  However,  our  experience  in  this  domain  [2]  showed  that  only  small 
changes  in  acoustic  environments  could  be  solved  by  this  kind  of  approach,  while  big 
changes  required  more  study  on  the  understanding  of  the  perturbation  nature  and 
learning  about  the  impressive  robustness  of  living  systems. 

Stein  and  Meredith  (1993)  demonstrated  that  the  information  processing  in  a  cat’s 
brain  are  initially  segregated  at  the  neural  level.  Neurons  dedicated  to  one  sense  do 
not  interact  with  those  of  another  sense  until  the  stimulus  is  transmitted  to  the  brain. 
There  the  signal  converges  to  the  same  target  in  the  superior  colliculus.  The  superior 
colliculus  appears  to  be  responsible  for  attentive  and  orientation  behaviors.  Based  on 
these  neurological  facts  two  levels  of  information  processing  are  considered  in  this 
paper. 

The  first  level  extracts  modes  from  the  global  stimulus  reaching  the  acoustic  sensor 
(i.e.  microphone).  The  different  modes  extracted  constitute  some  stable  and 
observable  states  of  the  acoustic  sensorial  subsystem  and  can  be  identified  to  one  of 
the  different  classes  of  sounds  in  a  given  vocabulary  of  automatic  speech  recognition 
system.  The  second  level  processes  the  modes  which  could  behave  chaotically  if  the 
patterns  encountered  at  the  training  and  testing  conditions  are  so  different.  The 
oscillations  and  chaos  are  ubiquitous  in  the  brain.  They  reveal  an  indecisiveness 
between  the  input  pattern  and  the  stored  patterns  and  it  is  guessed  by  some  authors 
that  oscillations  and  chaos  may  play  an  important  role  on  binding  and  integration  of 
different  information  in  the  brain.  Moreover,  some  other  findings  in  the  speech 
transmission  domain  have  shown  that  a  chaotic  system  like  the  Lorenz  system  defined 
by  a  set  of  differential  equations,  is  able  to  provide  some  self-synchronization  ability 
to  a  speech  transmitter  and  a  speech  receiver,  so  that  their  robustness  increases  [4]. 
Hence  the  oscillations  and  chaos  of  a  Modal  Neural  Network  could  be  used  to 
increase  the  robustness  of  a  speech  perceiver. 
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Several  methods  have  been  developed  for  controlling  the  chaos  [5]  by  which  an 
unstable  orbit  can  be  stabilized  in  a  chaotic  system.  Recently  some  authors  have  used 
them  to  control  neural  networks  states  by  implementing  an  external  artificially 
generated  signal  [6]. 

In  this  paper,  the  chaos  of  MNN  is  controlled  by  an  external  signal  measured  in  a  real 
world  (i.e.  by  a  visual  input).  The  MNN  in  a  chaotic  state  means  that  the  stimulus 
provided  by  the  acoustic  path  is  not  enough  to  suppress  the  indecisiveness  of  the  input 
sound  class.  By  using  the  modes  extracted  from  the  visual  path  as  control  signals,  a 
stabilization  occurs. 

In  the  next  section,  we  describe  the  general  features  of  a  MNN.  In  section  3,  the 
network  controller  is  introduced.  In  section  4  our  experiment  protocol  and  results  are 
discussed.  In  section  5,  we  compare  the  results  to  others  and  conclude  the  paper  in 
section  6. 


2.  MODAL  NEURAL  NETWORK 

In  this  study,  we  assume  two  kinds  of  perturbations  which  in  turn  correspond  to  two 
kinds  of  robustness  problems  for  a  given  information  integration  system: 

•  coherent  perturbations  can  be  solved  by  a  control  theory  approach.  In  this 
framework  some  corrections  of  the  system  states  could  be  obtained  by  designing  a 
closed  loop  to  drive  the  modes  of  the  system.  For  us,  this  case  corresponds  to  a 
testing  condition  not  so  different  than  the  training  conditions. 

•  chaotic  perturbations  require  that  the  system  be  able  to  visit  some  new  stable 
states  which  are  very  far  from  the  conditions  encountered  during  the  training 
phase.  In  other  words  the  system  must  be  "creative"  because  it  reflects  the  states 
of  a  creative  system  ("human  talker").  Thus,  the  study  carried  here  could  be  seen 
as  a  complementary  vision  of  the  robustness  problem  as  discussed  in  [2]. 

We  associate  those  two  different  perturbations  to  kind  and  aggressive  robustness 
problems  respectively.  Kind  robustness  can  be  achieved  by  using  the  Kalman  filtering 
or  any  other  error  correction  method.  Aggressive  robustness  requires  a  more 
"creative"  behavior  of  the  control  system  like  the  chaotic  one.  For  the  task  considered 
in  this  paper,  we  model  kind  robustness  with  the  Artificial  Neural  Networks  which  are 
considered  to  be  noise-resistant,  and  specifically  with  Modal  Neural  Networks  while 
aggressive  robustness  is  modeled  by  the  chaotic  control  of  the  MNN  modes. 
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Fig.  1 :  A  system  architecture  for  robust  audio-visual  speech  recognition.  Two  Modal  Neural 
Networks  (see  text)  model  the  information  processing  subsystems  for  the  visual  and  acoustic 

paths. 

A  MNN  can  be  taken  as  any  neural  network  implementation  of  ANN  to  store,  retrieve 
and  manipulate  some  stimuli.  In  the  speech  domain  a  MNN  for  audio-visual  speech 
perception  takes  as  input  a  speech  signal  together  with  some  lip-opening  parameters 
and  thus  outputs  the  class  of  the  sound:  it  implements  a  sound  classifier. 

Let  I  represent  the  input  space  and  O  the  output  space  of  data  recorded  from  a 
system  for  which  we  try  to  find  a  model.  The  solution  obtained  with  an  Artificial 
Neural  Network  can  be  understood  as  the  search  of  a  function  f  {W  J  ,E)  which 

associates  /  to  O  in  terms  of  information  encoding  into  the  weights  W  of  the  neural 
network.  If  the  network  is  successfully  implemented  and  trained  then  we  can  consider 

that  is  has  computed  an  inverse  matrix  A  so  that  we  have:  O  =  A  ^/  .  The  modes  of 
this  latter  matrix  store  the  underlying  mapping  between  I  and  O .  Different  solutions 
exist  to  solve  this  problem,  e.g.  the  Hopfield  network  which  seems  to  be  plausible  in 
terms  of  biological  implementation.  They  are  well-known  examples.  Some  other 
solutions  like  feed-forward  neural  networks  could  be  used  to  obtain  the  function 
/  and  the  different  modes. 
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3.  CONTROLLING  THE  MODES  OF  A  MNN 

From  the  discussion  in  section  2,  we  will  model  a  MNN  as 

a  system  S  having  M  modes  A.  /  =  1*  •  *  M  .  Let  us 

denotes  m(t)  =  : 

The  problem  is  now  to  control  the  movements  of  the  different  modes  so  that  they  go 
towards  some  desired  targets.  When  the  modes  are  close  to  the  unit  circle  in  the 
Nyquist  plane,  then  there  are  some  oscillations.  When  the  modes  are  real  and  positive 
there  is  instability  and  chaos. 

We  will  use  a  parallel  updating  rule  introduced  in  [3]  and  defined  as: 
m(t)  =  1  -  2^{a^m(t)  +  a2[m(t)f  +  I(t  ))with 

where  I(t)  represents  a  control  signal  and  ,  ^2  two  weighting  parameters. 

This  rule  generates  phenomena  like  periodic,  chaotic  and  bifurcation  behaviors  for 
different  values  of  the  noise  level  (J  [6].  The  control  consists  in  taking  the  modes 
computed  in  the  visual  path  as  /(/)  and  to  recover  clean  acoustic  modes  from  noisy 
ones. 
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Fig.  3  :  The  modes  computed  on  the  audio  path  perturbed  by  another  vowel  sound  with  a  SNR 
of  -20dB  are  chaotic  (a).  Conversely  the  visual  path  has  less  perturbed  modes  (b).  The 
stabilization  due  to  the  visual  path  modes  occurred  in  (d)  after  a  period  of  indecisiveness  in  (c). 


4.  EXPERIMENTS 

The  movie  of  a  speaker  pronouncing  the  10  French  vowels  is  considered  the 
experimentation  start  point. 

The  image  of  the  speaker’s  face  is  processed  by  the  Snakes  method  to  extract  some 
lip-contours  [7].  We  applied  this  method  to  extract  some  lip-opening  parameters  (the 
center  of  gravity  of  the  mouth  opening  area,  the  vertical  and  horizontal  lip  width). 
Those  three  lip  parameters  are  input  to  the  visual  MNN. 

Similarly,  each  vowel  acoustic  signal  is  processed  with  the  Hermansky  PLP  method 
[8].  The  resulting  5-dimension  vector  is  given  to  the  acoustic  MNN. 

For  each  vowel  100  pairs  of  audio-visual  patterns  are  obtained  for  the  experiment. 

The  experiment  concerns  the  control  of  the  modes  computed  on  the  auditory  path  by 
the  visual  path.  The  MNNs  (acoustic  and  visual)  are  two  feed-forward  neural 
networks  [9]  which  are  trained  to  classify  the  10  vowels.  After  the  training  the  MNNs 
weights  are  frozen.  For  each  MNN  we  do  not  make  the  sigmoidal  thresholding  at  the 
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output  units.  Thus  they  have  outputs  varying  between  0.0  and  1 .0  corresponding  to  the 
strength  of  the  network  modes.  For  more  discussions  on  the  training  of  these  networks 
see  [9]. 

The  first  experiment  concerns  the  audio-visual  recognition  of  the  10  vowels  perturbed 
by  a  gaussian  noise  (see  Fig.  2)  and  the  second  for  a  perturbation  by  the  vowel  /a/ 
signal  pronounced  by  a  female  speaker.  Note  that  this  latter  case  corresponds  to  a 
situation  for  which  two  speakers  speak  at  the  same  time  (i.e.  the  perturbation  is  the 
voice  of  another  speaker).  The  two  experiments  are  carried  out  for  a  Signal  to  Noise 
Ratio  (SNR)  of -20  dB. 

In  a  quiet  environment  the  acoustic  input  allows  a  good  computation  of  the  modes  of 
the  10  vowels  which  would  be  on  the  diagonal  like  in  Fig.  2d.  When  the  noise  level 
increases  (see  Fig.  2a),  the  coordinates  of  the  modes  behave  chaotically  because  no 
sufficient  information  is  delivered  by  the  auditory  path.  This  unstable  behavior  can 
then  be  controlled  by  the  visual  path  which  brings  the  complementary  information 
(Fig.  2c)  to  stabilize  the  acoustic  modes  (see  Fig.  2d). 

If  we  change  the  kind  of  perturbation  we  observe  similar  results.  However  the  effect 
of  the  visual  path  in  the  modal  space  is  not  as  good  (Fig.  3c),  and  this  is  due  to  the 
great  perturbation  of  the  vowel  spectral  information  when  the  perturbation  is  a  vowel 
sound.  In  the  two  cases  we  achieved  the  stabilization. 


5.  DISCUSSION 

Many  studies  have  been  carried  out  on  the  control  of  the  chaos.  Among  them  some 
have  specifically  been  applied  to  speech  domains.  Cuomo  and  al.  (1993)  have  shown 
the  interest  of  using  Lorenz  chaotic  system  for  a  robust  transmission  between  a 
transmitter  and  a  receiver.  The  basic  idea  is  to  take  advantage  of  the  self¬ 
synchronization  feature  of  a  chaotic  system  which  is  able  from  any  starting  initial 
condition  to  reach  a  given  target.  When  a  receiver  and  a  transmitter  lose  their 
synchronization  because  of  some  perturbations  in  the  channels,  then  the  two  chaotic 
systems  are  able  to  re-synchronize  by  themselves.  The  application  consists  in 
considering  the  speech  signal  to  be  transmitted  as  a  low-level  perturbation  and  this 
increases  the  reliability  and  robustness  of  the  transmission.  In  this  study  this  feature  is 
used  to  move  a  MNN  toward  a  nominal  performance  state. 

The  control  of  a  Hopfield  neural  network  has  been  shown  in  [6].  It  showed  a 
possibility  to  use  a  chaotic  system  as  a  two-module  information  processing  system. 
The  second  neural  system,  identified -as  the  colliculus  in  the  cat’s  brains,  sends  a  signal 
which  stabilizes  the  chaotic  behavior  of  the  first  neural  system  which  stores  and 
recalls  patterns.  Only  some  simulated  data  have  been  tested. 

As  far  as  we  know,  our  present  study  seems  to  represent  the  first  attempt  to  position 
the  modes  computed  from  audio-visual  data  recorded  by  a  human  subject. 


614 


6.  CONCLUSION 


We  described  a  framework  for  controlling  the  chaotic  movements  of  the  modes  of  a 
neural  network  which  learns  to  recognize  visual  speech.  An  external  signal  taken  from 
another  sensorial  input  (visual)  allowed  us  to  stabilize  the  network.  The  results 
obtained  on  the  10  French  vowels  displayed  in  a  modal  space  and  perturbed  by  a 
white  noise  (gaussian)  and  a  colored  noise  (another  vowel  sound)  has  shown  some 
strong  capabilities  of  the  Modal  Neural  Network  for  processing  new  incoming  events 
occurring  in  real  world  conditions.  The  complementarity  between  the  auditory  and  the 
visual  paths  is  taken  at  the  level  of  modes  whereas  the  environment  has  been  supposed 
to  be  handled  by  the  MNN.  The  general  methodology  based  on  two  interacting 
information  processing  modules  could  be  applied  to  other  kind  of  sensor  integration 
problems  and  related  to  other  studies  on  multi-modal  data  analysis. 

If  the  basic  structure  of  the  MNN  is  a  Hopfield  network  then  our  study  may  be 
compared  to  some  biological  plausible  implementation  of  information  processing  in 
the  brain.  Nevertheless,  the  methodology  developed  could  be  taken  as  a  contribution 
to  the  modeling  of  robust  information  processing  for  real  world  applications. 
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ABSTRACT 

A  demonstration  program  has  recently  been  completed  to  apply 
various  artificial  intelligence  techniques  including,  neural 
networks,  expert  systems,  and  case>based  reasoning  to  fault 
detection  in  satellite  communications  systems.  The  GMM 
program  implemented  these  techniques  for  Global  Military 
Satellite  Communications  Maintenance.  Neural  networks  were 
designed  and  trained  to  analyze  incoming  Built-In-Test  (BIT) 
fault  signatures  from  the  satellite  communications  terminal. 
Expert  systems  were  developed  to  embed  diagnostic  knowledge 
relating  to  equipment  maintenance.  The  prototype  hybrid 
system  uses  neural  filters  to  detect  faults,  which  are  further 
processed  by  expert  systems  to  classify  the  faults  and  provide 
repair  directions. 


INTRODUCTION 

Present  military  communications  systems  perform  Built-In-Test  (BIT)  false 
alarm  reduction  and  fault  isolation  through  largely  heuristic  algorithms  such  as 
n-of-m  filtering  and  deterministic  isolation  trees.  These  methods  are  often 
implemented  in  a  combination  of  system  processing  (typically  software)  and 
manual  procedures  contained  in  maintenance  manuals.  Furthermore,  methods 
of  collecting  BIT  history  often  rely  on  the  maintainer  to  fill  out  paperwork,  with 
no  systematic  feedback. 
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Once  fielded,  BIT  perfonnance  of  these  systems  has  proven  difficult  to 
improve,  both  because  upgrades  are  difficult  and  because  of  the  lack  of 
centralized  analysis  and  machine  learning.  Consequently  excessive  unnecessary 
repair  actions  and  the  need  for  skilled  maintainers  keep  maintenance  costs  high 
over  the  life  of  the  system. 


GMM  AI-BASED  APPROACH  TO  BIT 

The  purpose  of  the  Global  MILS  ATCOM  Maintenance  (GMM)  program  was  to 
apply  Artificial  Intelligence  and  Neural  Network  methods  to  address  the  lack  of 
automated  systematic  improvement  BIT  performance,  especially  in  the  post- 
fielding  environment. 

In  the  GMM  concept,  terminals  process  BIT  fault  signatures  using  AI 
techniques.  Whenever  a  maintenance  action  occurs,  history  is  automatically 
transmitted  to  a  central  analysis  facility  using  satellite  communication 
resources  on  an  as-available  basis.  At  the  central  facility,  a  diagnostic/system 
analyst  uses  AI  tools  to  analyze  the  accumulated  history  and,  if  appropriate, 
generate  updates  to  the  AI  processing  in  the  fielded  terminals.  Hence,  each 
fielded  system  benefits  from  the  cumulative  experience  of  all  fielded  systems. 

GMM  has  the  potential  to  significantly  reduce  fielded  system  maintenance 
costs.  The  savings  can  arise  from  improved  fault  diagnosis  which  allows 
operators  to  assume  field  maintenance  duties,  thereby  eliminating  skilled  field 
maintainers  and  related  training  and  personnel  costs,  and  reducing  unnecessary 
repair  actions. 

The  goal  of  the  GMM  program  was  to  demonstrate  the  GMM  concept  by 
developing  and  integrating  prototype  components  employing  the  AI  techniques 
of  neural  networks,  expert  systems  and  case-based  reasoning. 

Figure  1  graphically  depicts  the  system  concept.  In  this  concept,  the  fielded 
terminals  may  have  neural  network  filters  at  several  levels  in  the  equipment 
hierarchy:  module,  assembly,  subsystem  and  system.  These  neural  filters  may 
receive  corroborating  fault  signature  data  from  BIT  hardware  and  other  neural 
filters  and  also  environmental  information  such  as  temperature  or  vibration. 
They  use  this  data  to  make  local  low-level  fault  classifications.  The  results  of 
the  neural  filters  may  be  used  by  an  expert  system  to  make  higher-level  fault 
classifications  and  give  repair  directions. 
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•BIT 

•  Neural  Network(s) 

•  Expert  System  Classifier 

•  Repair  Documentation 


•  Data  Base  Of  Fault  Histories 

•  NN  Trainingn'esting  Data 

•  Expert  Logic 

•  Tools  To  Detect  Trends  and 
Anomalies 

•  Neural  Network  Updates 

■  Expert  System  Classifier  Updates 


Figure  1.  GMM  System  Concept 


Reports  on  BIT  fault  indications  are  sent  to  a  central  facility  and  stored.  Fault 
signatures,  environmental  information,  repair  actions  and  fault  isolation 
decisions  are  also  sent,  using  the  communication  system  on  a  low  priority  "as- 
available"  basis.  In  the  conceptual  GMM  system,  the  central  site  has  a  storage 
capability  for  fault  history  data  from  the  terminals.  An  analyst  reviews  the  data 
to  determine  if  the  neural  networks  or  the  expert  system  logic  are  incorrect. 
The  fault  isolation  logic  can  be  modified  and  new  test  cases  can  be  added  to  the 
training  data  library  of  the  neural  networks.  Periodically  the  fault  isolation 
logic  and  the  neural  networks  can  be  updated  and  distributed. 

The  use  of  a  centralized  Al-based  analysis  facility  with  automated,  user- 
transparent  access  to  fault  and  repair  data  from  a  large  community  of  fielded 
systems  which  themselves  employ  AI  techniques  is  the  key  to  effectively 
introducing  AI  and  learning  techniques  at  the  organizational  maintenance 
level. 
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In  the  GMM  concept,  neural  network  technology  was  selected  for  use  at 
terminal  field  sites  to  refine  and  improve  the  results  of  BIT  fault  reporting  and 
subsequent  fault  detection.  The  results  of  this  development  are  described  in 
this  paper. 

Expert  system  technology  was  selected  to  improve  the  knowledge  acquisition 
process  as  well  as  to  provide  the  capability  to  generate  executable  fault  isolation 
logic.  In  concept,  the  diagnostic  system  will  consist  of  two  tools,  one  for 
diagnostic  fault  logic  development  and  one  for  execution  of  that  logic. 

The  technology  of  case-based  reasoning  (CBR)  was  selected  for  use  in 
maintaining  and  optimizing  parts  of  the  system  which  can  be  more  costly  to 
maintain  traditionally,  such  as  the  expert  system.  In  addition,  CBR  embodies 
an  inherent  mechanism  for  system  adaptation  and  learning.  The  case-based 
reasoning  function  examines  the  failure  events,  compares  them  to  known  cases 
in  the  library,  and  recognizes  deficiencies.  Also,  since  the  incoming  failure 
events  may  not  exactly  match  the  existing  cases  in  the  library,  the  case-based 
reasoning  function  will  incorporate  new  experiences  into  the  library  by 
adapting  existing  cases  with  information  from  the  new  event  to  make  new 
cases.  The  case  library  will  grow  with  new  information,  which  can  then  be 
used  by  the  system  analyst  to  determine  where  and  when  to  update  the  other 
functions  of  the  system.  The  global  position  of  the  case  library  and  case-based 
reasoning  function  will  allow  the  distributed  knowledge  of  the  system  to  be 
collectively  examined,  refined,  and  returned  to  all  sites  consistently.  After  the 
concept  was  defined,  funding  cuts  did  not  permit  this  CBR  component  to  be 
implemented  in  the  prototype. 


NEURAL  NETWORK  DEVELOPMENT 

We  selected  a  representative  terminal  system  with  BIT  based  on  traditional 
technology.  The  system  was  comprised  of  a  rich  set  of  replaceable  units.  These 
replaceable  units  are  noted  in  this  paper  as  generic  components  labeled  units  A- 
E  as  depicted  in  figure  2. 
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Figure  2:  Representative  Terminal  System 

The  fault  signatures  available  can  be  represented  in  the  following  ambiguity 
groups: 


•  a  subcomponent  of  Unit  B 

•  a  subcomponent  of  Unit  C 

•  the  cable  between  Units  A  and  B 

•  the  cable  between  Units  B  and  C 

•  the  cable  between  Units  C  and  D 

•  the  cable  between  Units  C  and  E 

•  the  waveguide  between  Units  C  and  E 

•  a  subcomponent  of  either  Unit  B  or  Unit  C,  but  not  the  cable 
between  Units  B  and  C 


The  BIT  signatures  for  the  Unit  B/Unit  C  have  96  bits  in  their  raw  form. 
Examination  of  the  Unit  B/Unit  C  fault  logic  trees  revealed  that  the  terminal 
isolation  processing  uses  less  than  half  of  the  fault  bits.  Those  bits  that  did  not 
contribute  to  a  decision  (i.e.,  designator  bits,  RF  switch  status,  etc.)  were 
removed  from  consideration,  leaving  39  relevant  bits. 

Next,  a  matrix  which  we  called  the  Base  Vector  Table  (BVT)  was  constructed 
with  a  column  for  each  of  the  39  bits  and  a  row  corresponding  to  each 
indictment  as  defined  by  each  distinct  solution  node  of  the  fault  logic  trees. 
The  BVT  provided  a  tabular  representation  of  the  terminal’s  fault  isolation 
processing  for  the  Unit  B  and  Unit  C. 
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After  the  input  data  sources  were  identified  and  examined,  a  set  of  network 
output  classifications  was  defined: 

1.  Unit  A/cable/UnitB 

2.  UnitB 

3.  Unit  B/cable/Unit  C 

4.  UnitB/UnitC 

5.  Unit  D/cable/Unit  C 

6.  Unite 

7.  waveguideAJnit  EAJnit  C 

8.  Unit  E/cable/Unit  C 

9.  No  Fault 

From  the  set  of  9  output  classes,  two  others  were  derived.  The  first  consisted  of 
4  classes  and  grouped  all  Unit  B  indictments,  all  Unit  C  indictments  and  all 
Unit  BAJnit  C  indictments  together: 

1 .  Unit  B  (from  faults  1  and  2  above) 

2.  Unit  C  (from  faults  5,  6,  7,  and  8  above) 

3.  Unit  B/Unit  C  (from  faults  3  and  4  above) 

4.  No  Fault  (from  9  above) 

The  second  consisted  of  a  binary  "yes/no"  classification,  based  on  the  given 
unit.  This  required  two  binary  classifiers,  one  for  the  Unit  B  and  one  for  the 
Unit  C.  The  binary  Unit  B  classifier  had  the  following  classes: 

1.  Unit  B  (from  faults  1,  2,  3,  and  4  above) 

2.  Not  Unit  B  (from  faults  5,  6,  7,  and  8  above) 

The  binary  Unit  C  classified  had  the  following  classes: 

1.  Unit  C  (from  faults  3  through  8  above) 

2.  Not  Unit  C  (from  faults  1,  2,  and  9  above) 

The  motivation  for  the  binary  network  architecture  was  to  configure  the 
networks  as  binary  decisions  with  an  overall  classification  scheme  resembling  a 
binary  decision  tree.  By  structuring  the  individual  neural  networks  as  binary 
decisions  with  one  output  node,  the  network  can  effectively  be  “tuned”  to  detect 
the  desired  classification.  As  a  detector,  well-established  statistical  methods 
can  be  applied  to  better  understand  the  data  and  the  network.  These  methods 
are  discussed  in  detail  in  Appendix  A  which  describes  the  Multi-layer  Layer 
Perceptron  (MLP)  design  algorithm  [1]. 
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This  algorithm  uses  information  from  the  training  data  set  to  structure  the 
network,  including  calculation  of  the  necessary  number  of  hidden  nodes  to 
execute  the  mapping,  and  computation  of  "good"  starting  weights.  The 
algorithm  provides  a  means  of  designing  portions  of  the  architecture  of  the 
neural  networks  in  advance.  It  can  also  reduce  training  time,  improve  the 
convergence  rate,  increase  the  likelihood  of  convergence  at  a  global  minimum, 
and  improve  the  generalization  performance  on  novel  data. 

NEURAL  NETWORK  RESULTS 

The  failure  data  collected  during  factory  testing  demonstrated  the  potential  of 
pre-deployment  "field"  data  to  provide  early  feedback  to  a  central  facility  in  a 
GMM  system,  or  to  system  engineers  in  a  more  traditional  design  environment. 
However,  to  be  effective,  all  opportunities  to  collect  BIT  signature  data  should 
be  exploited.  In  addition,  a  rigorous  and  preferably  automated  method  of 
linking  repair  actions  to  failures  and  BIT  signatures  must  be  in  place. 

A  summary  of  the  training/testing  results  for  all  networks  is  given  in  Table  1. 
The  fact  that  the  training  and  testing  results  were  similar  indicated  that  the 
networks  were  able  to  generalize  and  were  not  "memorizing"  the  training  data. 
The  results  indicated  that  the  networks  could  distinguish  well  among  the 
different  classes. 


Network 

%  Correct 
Synthesized 
Training  Data 

%  Correct 
Synthesized 
Testing  Data 

%  Correct 
Factory 

Data 

9-Class  Heuristic 

100.0 

98.4 

100.0 

4-Class  Heuristic 

99.4 

99.0 

96.4 

Binary  Unit  B  Statistical 

100.0 

95.9 

96.4 

Binary  Unit  C  Statistical 

99.5 

91.5 

93.0 

Table  1.  Results  of  Network  Training  and  Testing 
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ADVANTAGES  OF  THE  GMM  APPROACH 

The  major  advantage  of  the  GMM  approach  to  maintenance  is  the  automated, 
closed-loop  nature  of  the  data  collection,  analysis,  and  feedback  path.  This 
allows  each  fielded  system  to  benefit  from  the  cumulative  experience  of  all 
fielded  systems. 

Fault  information  is  processed  using  a  variety  of  AI  techniques,  each  selected 
for  its  particular  suitability  to  the  task  it  performs.  Whenever  a  maintenance 
action  occurs  at  a  given  field  site,  history  is  automatically  transmitted  to  a 
central  analysis  facility  where  a  diagnostician  or  system  analyst  uses  AI  tools 
to  analyze  the  accumulated  history  and,  if  appropriate,  generate  updates  to  the 
AI  processing  in  all  the  fielded  systems.  In  addition,  the  automated  nature  of 
the  process  provides  a  much-needed  consistency  in  the  methods  of  making  and 
installing  modifications  to  the  diagnostic  tools  and  process. 

The  capability  of  evolutionary  improvement  was  experienced  first-hand  by  the 
GMM  neural  network  developers.  During  network  training  it  was  discovered 
that  the  "No  Fault"  data  class  had  not  been  included  in  the  training  data  for  the 
statistical  networks.  As  a  result,  their  classification  results  were  initially  poor. 
Examples  of  "No  Fault"  data  were  added  to  the  training  data  set  and  the 
networks  were  retrained  and  tested.  As  expected,  a  second  iteration  of  testing 
against  the  factory  data  yielded  much  improved  results  as  shown  in  Table  2. 
This  is  an  example  of  the  GMM  concept  in  practice. 


Network 

%  Correct  Factory 

Data  Without  “No 

Fault” 

%  Correct  Factory 

Data  With  “No  Fault” 

Binary  Unit  B  Statistical 

75.0 

96.4 

Binary  Unit  C  Statistical 

64.0 

93.0 

Table  2:  Improved  Results  by  Adding  “No  Fault”  Class 
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SUMMARY 


The  major  conclusions  of  the  project  were  that  the  study  of  the  neural  networks 
indicated  that  the  backpropagation  network  model  has  potential  for  application 
to  fault  location.  The  expert  system  tools  KAS  and  LTS  were  found  to  be  well- 
suited  for  inclusion  in  a  GMM  system.  Additional  work  is  needed  to  prototype 
an  implementation  of  the  Central  Analysis  Facility  in  order  to  confirm  the 
effectiveness  of  the  overall  GMM  concept.  Enhancements  to  the  GMM  concept 
were  identified  that  would  broaden  its  scope  beyond  fielded  systems  to  include 
depot  facilities,  an  automated  hardware  component  tracking  system,  and  a 
maintainer  training  system. 

The  work  described  in  this  paper  was  performed  by  Raytheon  Company  under 
the  Global  MILSATCOM  Maintenance  (GMM)  contract  for  the  Defense 
Advanced  Research  Projects  Agency  (DARPA)  with  U.  S.  Army  CECOM,  PM 
SATCOM  as  Executive  Agent  [3].  It  expanded  upon  work  performed  under  a 
previous  contract,  Neural  Network  False  Alarm  Filtering,  which  was  performed 
by  Raytheon  Company  for  Rome  Laboratory  Air  Force  Materiel  Command  [4]. 
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Abstract 

This  paper  proposes  a  multi-linguistic  handwritten  characters  recog¬ 
nition  system  based  on  Bayesian  decision-based  neural  networks  (BDNN). 
The  proposed  system  consists  of  two  modules:  First,  a  coarse  clas¬ 
sifier  determines  an  input  character  to  one  of  the  pre-defined  sub¬ 
classes  partitioned  from  a  large  character  set,  such  as  Chinese  mixed 
with  alphanumerics.  Then  a  character  recognizer  determines  the 
input  image  to  its  most  matched  reference  character  in  the  subclass. 

The  proposed  BDNN  can  be  effectively  applied  to  implement  all  these 
modules.  It  adopts  a  hierarchical  network  structures  with  nonlin¬ 
ear  basis  functions  and  a  competitive  credit-assignment  scheme.  Our 
prototype  system  demonstrates  a  successful  utilization  of  BDNN  to 
handwriting  of  Chinese  and  alphanumeric  character  recognition  on 
both  the  public  databases  (HCCR/CCL  for  Chinese  and  CEDAR  for 
the  alphanumerics)  and  in  house  database  (NCTU/NNL).  Regarding 
the  performance,  experiments  on  three  different  databases  all  demon¬ 
strated  high  recognition  (88'^92%)  accuracies  as  well  as  low  rejec¬ 
tion/acceptance  (6.7%)  rates,  as  elaborated  in  Section  3.2.  As  to 
the  processing  speed,  the  whole  recognition  process  (including  image 
preprocessing,  feature  extraction,  and  recognition)  consumes  approx¬ 
imately  0.27second/character  on  a  Pentium-90  based  personal  com¬ 
puter,  without  using  hardware  accelerator  or  co-processor. 


1  Introduction 

Machine  recognition  of  characters  has  been  a  topic  of  intense  research  since 
1960’s  [4].  During  the  last  decade,  more  and  more  commercial  products  were 

*This  research  was  supported  in  part  by  the  National  Science  Council  under  Grant  NSC 
85-2213-E009-125. 

1  The  corresponding  author. 
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available  in  the  market.  On  the  other  hand,  due  to  the  enormous  number  of 
variations  involved,  handwriting  recognition  applications  still  require  more 
work  before  they  can  reach  comparable  performance  as  a  human.  Generally 
speaking,  current  handwriting  recognition  techniques  and  problems  can  be 
categorized  into  the  following  two  types:  (1)  on  the  character  input  methods 
(e.g.,  on-line  and  off-line  approaches),  and  (2)  on  the  character  languages. 
On-line  and  off-line:  For  the  on-line  approach,  a  computer  receives  tra¬ 
jectory  coordinates  by  sampling  the  writing  trace  from  a  pen-based  panel  or 
a  tablet.  For  the  off-line  approach,  the  whole  text  image  is  scanned  into  a 
computer  and  is  stored  in  digital  image  format.  Therefore,  off-line  character 
recognition  process  are  usually  more  difficult  than  on-line  approaches.  Nev¬ 
ertheless,  both  on-line  and  off-line  recognition  techniques  have  their  unique 
technical  prc^lems. 

Language  issues:  It  is  well  known  that  Chinese  characters  (including  Kanji 
in  Japanese)  are  unique  and  different  from  that  of  western  languages,  in  that 
they  are  non- alphabetic  and  have  quite  complicated  stroke  structures.  Taking 
modern  Chinese  language  as  example,  there  are  more  than  5,000  commonly 
used  characters  and  a  word  may  consists  of  one  or  several  characters.  The 
large  character  set  of  a  language  usually  causes  degradation  in  recognition 
accuracy  and  speed. 

In  addition,  directly  applying  current  mono-language  character  recogni¬ 
tion  techniques  [2]  to  multi-linguistic  document  will  be  quiet  difficult,  due 
to: 

1.  separating  mixed  characters  of  different  languages  efficiently  and  cor¬ 
rectly  is  not  trivial,  sometimes  it  is  just  as  hard  as  recognizing  charac¬ 
ters, 

2.  implementing  two  or  more  different  types  of  recognition  modules  is  not 
time  and  space  (both  software  and  hardware)  efficient, 

3.  combining  recognition  results  from  two  different  types  of  recognition 
modules  is  somewhat  an  unnecessarily  extra  work. 

Therefore,  it  is  desirable  to  design  a  uniform  recognition  architecture  for 
the  multi-linguistic  character  recognition.  First,  select  a  set  of  proper  char¬ 
acter  features  for  characters  of  different  languages  in  a  document,  such  that 
they  can  be  represented  by  a  uniformed  feature  vectors.  Then,  a  charac¬ 
ter  recognition  architecture  for  large  character  set  can  be  adopted  directly 
(or  with  just  minor  modification)  to  multi-linguistic  character  recognition. 
Comparing  to  the  large  character  set  such  as  Chinese,  alphanumerics  can  be 
considered  as  a  small  set  of  special  characters  to  the  larger  Chinese  character 
set.  Thus,  uniformed  feature  selection  and  recognition  architecture  can  be 
applied. 

It  is  well  known  in  the  statistical  pattern  recognition  [1],  that  the  Bayesian 
decision  rule  can  be  implemented  as  an  optimal  image  pattern  classifier.  By 
applying  the  statistical  features  of  a  character  image,  the  Bayesian  rule  can 
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be  used  to  match  an  input  character  to  a  reference  character  with  minimal 
classification  error.  A  Bayesian  Decision-based  Neural  Network  (BDNN)  has 
the  merits  of  both  neural  networks  and  statistical  approaches.  For  example, 
Lin  and  Rung  [3]  proposed  a  probabilistic  Decision  Based  Neural  Networks 
(PDBNN)  for  the  implementation  of  a  face  recognition/detection  system. 
More  specifically,  the  modularity  of  the  BDNN  makes  itself  suitable  imple¬ 
mentation  for  not  only  the  mono-language,  but  also  for  the  multi-linguistic 
language  character  recognition.  Therefore,  we  propose  the  BDNN  to  attack 
the  multi-linguistic  character  recognition  problems.  The  organization  of  this 
paper  is  as  follows:  In  section  2,  the  mathematical  background  and  the  archi¬ 
tecture  of  the  BDNN  is  presented.  Then,  the  overview  of  a  total  system  for 
handwriting  recognition  is  presented  in  Section  3.  The  system  consists  of  two 
modules,  which  are  all  implemented  by  the  BDNN.  The  two  modules:  coarse 
classifier  and  character  recognizer  are  discussed  in  great  details.  Experimen¬ 
tal  results  of  these  modules  are  provided  in  both  of  these  two  sections. 

2  Bayesian  Decision-based  Neural  Network 

Bayesian  Decision-based  Neural  Network  (BDNN)  is  a  Bayesian  rule  based 
modular  neural  network  for  classification.  One  subnet  of  a  BDNN  is  designed 
to  represent  one  object  class.  Suppose  there  are  k  categories  wi,  •  •  • ,  in  the 
feature  space  of  a  classifier  or  a  cluster,  the  Bayesian  decision  rule  classifies 
input  patterns  based  on  their  posterior  probabilities:  An  input  character  x 
is  classified  to  category  loi  if  P{uJi  \  x)  >  P{u)j  \  x),  for  all  j  ^  i.  Suppose  the 
likelihood  density  of  input  pattern  x  given  category  Ui  is  a  D-dimensional 
Gaussian  distribution,  the  posterior  probability  P{uJi  \  x)  by  Bayes  rule  is 
P{ui  I  x)  =  p(x  I  a;,)  iV(/z,-,S0  where  P{u:i)  is  the  prior 

probability  of  category  uji  (XliLi  P{^i)  —  1)?  Pi^)  “  Eili  P{^k)p{^  I 
eok)’ 

The  category  likelihood  function  p(x  |  coi)  can  be  extended  to  the  mixture 
of  Gaussian  distributions.  Define  p{x  \  to  be  one  of  the  Gaussian 

distributions  which  consist  of  p(x  |  uji),  p(x  |  w*)  =  ^i^ri  \  ^i)p{'^  I 

where  P(0r,  1  is  the  prior  probability  of  cluster  n,  and  0^-^ 
=  is  the  parameter  set  for  the  cluster  r*,  when  input  character 

patterns  are  from  category  w*.  By  definition,  P(€>r,  |  ^0=1-  In 

most  general  formulation,  the  basis  function  of  a  cluster  should  be  able  to 
approximate  the  Gaussian  distribution  with  full  rank  covariance  matrix,  that 

(1) 

where  is  the  covariance  matrix.  However,  for  those  applications  which 
deal  with  high  dimension  data  but  finite  number  of  training  patterns,  the 
training  performance  and  storage  space  discourage  such  matrix  modeling.  A 
natural  simplifying  assumption  is  to  assume  uncorrelated  features  of  unequal 
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importance.  Suppose  that  p(x  \  u^i^Sn)  is  a  D-dimensional  Gaussian  dis¬ 
tribution  with  uncorrelated  features,  that  Pn  =  IPni,  -  is  the 

mean  vector,  and  diagonal  matrix  ‘  z>]  i® 

variance  matrix.  As  shown  in  Figure  1,  a  BONN  contains  K  subnets,  which 
are  used  to  represent  a  /^-category  classification  problems.  Inside  each  sub¬ 
net,  an  elliptic  basis  function  (EBF)  is  used  to  serve  as  the  discriminate 
function  for  each  cluster  n:  <?i(x,  =  -  |  (^ndi^d  —  f^rid^  Or i- 

If  On  is  set  to  On  =  -yln27r+  |  Ingrid,  then  exp{(^(x,a;i,  0^^)}  can 
be  viewed  as  the  Gaussian  distribution,  as  described  in  Eqn.  (1),  except  for 
a  minor  notational  change:  =  ^nd' 

2.1  Learning  Rules  for  BDNN 

The  training  scheme  for  BDNN  contains  the  following  three  phaseOBs:  unsu¬ 
pervised  learning  phase,  supervised  training  phase,  and  self-grow  learn¬ 
ing  phase. 

Unsupervised  Learning  Phase:  The  first  phase  of  BDNN  training  is  an 
unsupervised  learning.  The  values  of  the  parameters  in  the  network  are  ini¬ 
tialized  in  this  learning  phase.  Many  unsupervised  clustering  algorithms,  such 
as  K-means  or  Vector  Quantization  methods  can  be  applied  to  unsupervised 
learning. 

Supervised  Learning  Phase:  As  for  the  supervised  learning,  teacher 
information  is  used  to  fine-tune  decision  boundaries.  When  a  training  pattern 
is  misclassified,  the  reinforce  or  an/«-rem/orcerf  learning  technique  is  applied: 

Reinforced  Learning  :  +  r}V(f{K,w) 

Antireinforced  Learning  :  —  77V^(x,w)  (2) 

Threshold  Updating  The  threshold  value  of  BDNN  recognizer  can  also 
be  learned  by  reinforced  or  anti-reinforced  learning  rules. 

Self-growing  Phase:  One  of  the  difficult  in  the  unsupervised  learning  for 
BDNN  is  the  decision  on  the  selection  of  the  proper  number  of  clusters  for 
the  K-means  algorithm.  Therefore,  we  propose  the  self-growing  of  clusters 
(in  the  following,  it  is  called  receptor)  during  the  supervised  learning  phase. 
There  are  three  main  aspects  for  the  self-growing  rules: 

(11)  When  a  new  receptor  should  be  created? 

(12)  Which  receptor  should  be  partitioned  to  create  a  new  receptor?  and 

(13)  How  to  initialize  the  centre  and  the  covariance  of  the  created  new  recep¬ 
tor? 

On  Issue  II,  while  the  training  sets  are  presented  again  and  again,  how¬ 
ever  the  train  status  (especially  the  recognition  rate)  states  unchanged  or 
improves  very  slowly.  In  other  words  the  current  BDNN  can  not  properly 
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learn  and  represent  the  whole  training  data  sets.  Therefore,  extra  Bayesian 
type  receptors  are  needed  to  improve  the  modeling  power  of  the  current 
BDNN. 

On  Issue  12,  when  an  extra  receptor  is  needed  to  improve  the  training 
performance,  we  propose  to  create  a  new  receptor  from  the  receptor  which 
produces  the  most  of  misclassification  during  the  recent  training  processes. 
On  Issue  13,  when  a  new  receptor  is  created,  its  initial  values  of  the  centre 
and  covariance  matrix  needs  to  be  properly  determined,  otherwise  the  classi¬ 
fication  capability  may  not  be  improved  efficiently.  Suppose  that,  a  character 
pattern  x  corresponding  to  cluster  0*  is  presented  to  a  BDNN  classifier,  then 
let  us  look  into  two  receptors:  the  receptor  Si  corresponds  to  cluster  0*,  and 
the  receptor  (say  Sj )  corresponds  to  the  largest  response  among  the  clusters 
other  than  0*.  Let  o*  and  oj  be  the  output  for  receptor  Si  and  receptor  Sj, 
respectively.  Let  us  define  the  ratio  of  these  two  outpus  with  respect  to  the 
training  pattern  x  as  px  =  According  to  the  retrieving  scheme  of  the 
proposed  BDNN,  if  oj  is  larger  than  or  equal  to  Oi  (i.e.,  px  >1),  then  the 
retrieving  result  of  input  pattern  x  must  be  wrong.  Apparently,  the  smaller 
the  Px  could  be,  the  better  the  classification  performance  would  be.  There¬ 
fore,  it  would  be  better  to  initialize  the  new  receptor  with  proper  parameters 
(i.e..  Pi,  Di),  such  that  the  px  has  smaller  values  (<  1)  with  respect  to  all  the 
other  receptors  Sj .  Since  a  proper  initial  value  po  and  So  will  have  smaller 
px,  thus,  the  best  position  for  the  centre  should  be  located  at  x,  i.e.,  po  =  x, 
so  that  the  new  receptor  will  generate  the  maximal  output,  for  the  input 
pattern  x.  To  determine  the  So,  we  let  So  =  crl,  (T  is  a  positive  constant 
(to  be  determined).  Suppose  that  the  cr  of  the  new  receptor  Si  is  not  prop¬ 
erly  determined,  as  shown  in  Figure  3(a),  then  the  created  receptor  will  have 
its  largest  possible  output  Oi{pi)  to  be  smaller  than  the  output  Oj{pi)  of  an 
existing  receptor  Sj .  Consequently,  the  new  receptor  Si  will  not  be  able  to 
reduce  the  value  of  px-  We  call  that  the  receptor  Si  is  '' overwhelmei'  by 
receptor  Sj.  To  prevent  the  overwhelming  problem.  Figure  3  (b)  shows  a 
properly  initiated  new  receptor  Si,  Therefore,  the  following  two  constraints 
are  suggested. 


Oj{-x)  = 

Oi{pj)  = 
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-exp 
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(4) 


These  two  constraints  imply  that  receptor  Si  and  receptor  Sj  will  not  over¬ 
whelm  each  other.  To  satisfy  Eq.(4),  cr  should  be  less  than  - ^  s  •  Thus, 

2^0j(X)  ^ 

bv  usine  - - — 5-  as  an  initial  value,  cr  can  be  iteratively  decreased  by  a 

*  27rOj(X)D 

small  value  ??  (0  <  77  <  1)  until  Eq.(4)  is  satisfied.  Then,  the  final  value  of  r 
can  be  used  as  the  initial  value  of  (To  for  the  new  receptor  Si . 
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3  BDNN  Handwritten  Character  Recognition 
System 

A  BDNN-based  Handwritten  character  Recognition  system  is  being  devel¬ 
oped  in  the  Neural  Networks  Laboratory  of  National  Chiao-Tung  University, 
The  total  system  configuration  is  depicted  in  Figure  2.  All  these  three  main 
modules,  pre-processing  and  feature  extraction  module,  coarse  classifier,  and 
character  recognizer  post-recognition  module  are  implemented  on  a  Pentium- 
90  based  personal  computer.  A  300  dpi  scanner  is  used  to  acquire  document 
images,  or  page  images.  The  acquired  binary  images  are  then  down  sized  to 
150  dpi  for  the  following  processing. 

3.1  Image  Pre-processing  and  Feature  Extraction 

Image  pre-processing  of  a  multi-linguistic  character  recognition  is  by  no 
means  of  any  different  from  the  mono-language  character  recognition.  The 
binary  images  of  a  handwritten  character  are  first  passed  through  a  series 
of  image  processing  stages,  such  as  boundary  smoothing,  noise  removing, 
space  normalization,  and  stroke  thinning  operations.  By  evaluating  most  of 
the  well  known  feature  [2],  we  selected  features  with  high  index  value,  such 
as  crossing  count  (CCT),  belt  shape  pixel  number  (BSPN),  and  stroke  ori¬ 
entation  feature  (STKO)  as  candidate  features  for  the  proposed  character 
recognition  system. 


3.2  Multi-stage  character  recognition 

Since  there  are  as  many  as  5000  commonly  used  characters,  and  62  of  al- 
phanumerics  and  symbols  in  a  traditional  multi-linguistic  Chinese  document, 
thus  it  is  desired  to  perform  a  coarse  classification  (or  clustering)  to  reduce 
the  domain  size  of  the  character  recognition.  By  having  a  small  working  do¬ 
main,  not  only  the  over  all  recognition  speed  and  recognition  rate  can  be 
greatly  improved,  also  the  training  on  the  fine-grained  character  recognizer 
can  be  much  easier  and  faster.  As  illustrated  in  Figure  4,  the  two  stage  mix¬ 
ture  character  recognizer  contains  a  coarse  classifier  as  the  first  stage,  then 
followed  by  a  fine-grained  character  recognizer. 

3.2.1  Coarse  classification 

In  order  to  achieve  a  balanced  recognition  performance  in  a  multi-stage  recog¬ 
nition  system,  the  coarse  classifier  needs  to  maintain  a  very  high  accuracy, 
(e.g.,  >  99.9%).  Although  this  is  a  difficult  task,  we  propose  to  use  the  CCT 
feature  and  overlapped  K-means  clustering  algorithm  [2]  for  the  implementa¬ 
tion  of  the  coarse  classifier.  By  applying  the  two  public  databases  suggested 
in  Section  3.3,  to  train  the  proposed  coarse  classifier,  this  goal  was  achieved. 
The  training  and  testing  results  are  listed  in  Table  1. 
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Table  1:  The  training  and  testing  results  of  coarse  classification 


Number 

of 

cluster 

Ave.  No.  of 
characters 
in  a  cluster 

Inside  test 
of  clustering 
rate 

Outside  test 
of  clustering 
rate 

60 

1156 

100  % 

99.9% 

3.2.2  Character  recognition 

The  design  of  character  recognizer  is  based  on  a  BDNN  for  a  character  basis. 
A  character  recognizer  is  dynamically  formed  by  a  set  of  character  BDNNs, 
which  are  one  to  one  corresponding  to  the  characters  in  an  activated  clus¬ 
ter  by  the  coarse  classifier.  The  training  of  the  character  BDNNs  was  first 
conducted  by  the  unsupervised  learning  schemes,  then  each  BDNN  was  fine 
tuned  by  the  supervised  learning  scheme.  During  the  retrieving  phase,  each 
of  the  character  BDNN  produces  a  score  according  to  the  mixture  Gaussian 
distribution  function  p{x  \  The  character  BDNN  with  the  highest 

score  is  the  winner  and  its  corresponding  reference  character  is  considered 
as  the  output  of  the  character  recognizer.  By  allowing  about  6.7%  of  rejec¬ 
tion  on  the  input  characters,  the  the  recognition  accuracy  can  reach  92.12%. 
Training  and  testing  results  of  the  character  recognizer  are  listed  in  Table  2. 

Table  2:  Experimental  results  of  the  Multi-linguistic  character  recognition 
by  BDNN. 


Test  type 

Top  1. 

Top  2. 

Top  3. 

inside 

91.07  % 

95.15  % 

96.18  % 

outside 

88.22  % 

89.76  % 

90.89  % 

w/rej 

92.12  % 

93.01  % 

93.68  % 

3.3  Handwritten  Character  Databases 

In  this  research,  there  are  two  main  sources  of  databases  for  the  training  and 
testing  of  BDNN:  the  CCL/HCCRl  [5]  database  and  the  CEDAR  database. 
The  CCL/HCCRl  database  contains  more  than  200  samples  of  5401  fre¬ 
quently  used  Chinese  characters.  The  samples  were  collected  from  2600  peo¬ 
ple  including  junior  high  school  and  college  students  as  well  as  employees 
of  ERSO/ITRI.  Each  sample  character  was  scanned  at  300dpi  to  generate 
a  144  X  150  pixel  image.  According  to  the  script  regularity,  each  charac¬ 
ter  sample  in  the  5401  database  was  manually  arranged  in  sequential  order. 
In  other  words,  suppose  a  handwritten  character  is  scripted  like  a  printed 
character,  then  it  is  placed  at  the  beginning  of  the  sequence.  Consequently, 
cursive  handwritten  character  samples  will  be  placed  at  the  bottom  part  of 
the  sequence.  The  CEDAR  database  contains  various  style  of  handwritten 
alphanumerics,  which  were  lifted  from  envelop  address  blocks  in  USA. 
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3.4  Overall  performance  evaluation 

The  system  built  upon  the  proposed  has  been  demonstrated  to  be  applicable 
under  reasonable  variations  of  character  orientation,  sizes,  and  stroke  width. 
This  system  also  has  been  shown  to  be  very  robust  in  recognizing  charac¬ 
ters  written  by  various  tools,  such  as  pencils,  ink  pens,  mark  pen  as  well  as 
the  Chinese  calligraphy  brushes.  The  prototype  system  takes  only  five  sec¬ 
onds  to  partition  characters  on  an  A4  size  input  image,  and  takes  270  ms  in 
average  to  identify  a  character  image  out  of  a  commonly  used  Chinese  char¬ 
acter  set  on  a  Pentium  100  based  personal  computer.  For  alpha-numerical 
character  recognition,  the  recognition  is  about  three  times  faster.  Further¬ 
more,  because  of  the  inherent  parallel  and  distributed  processing  nature  of 
BDNN,  the  technique  can  be  easily  implemented  via  specialized  hardware  for 
real-time  performance. 


4  Concluding  Remarks 

In  this  paper,  a  neural  network  based  handwritten  character  recognition  sys¬ 
tem  is  proposed  and  implemented  on  a  Pentium-90  based  personal  computer. 
This  system  performs  coarse  classification,  and  mixture  character  recogni¬ 
tion.  The  BDNN,  a  Bayesian  Decision-based  Neural  network,  is  applied  to 
implement  the  major  modules  of  this  system.  This  modular  neural  network 
deploys  one  subnet  to  take  care  one  object  (character),  and  therefore  it  is  able 
to  approximate  the  decision  region  of  each  class  locally  and  precisely.  This 
locality  property  is  attractive  especially  for  personal  handwriting  or  signa¬ 
ture  identification  applications.  Moreover,  because  its  discriminant  function 
obeys  probability  constraint,  BDNN  has  more  nice  properties  such  as  low 
false  acceptance/false  rejection  rates.  On  the  other  hand,  due  to  the  enor¬ 
mous  number  of  variations  involved,  handwriting  recognition  applications 
still  require  more  work  before  they  can  reach  comparable  performance  by 
a  human.  Therefore,  document  analysis  and  recognition  becomes  an  inter¬ 
esting  and  fascinating  research  topic  in  the  field  of  intelligent  information 
processing. 
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Output  of  a  newly  created  receptor  i 
Output  of  receptor  j 


(a) 


Output  of  receptor  j 

(b) 


Figure  3:  Example  of  receptive  field  overwhelming:  (a)  a  newly  created  re¬ 
ceptor  Si  is  overwhelmed  by  a  receptor  Sj]  (b)  the  proper  initialized  receptor 
Si  and  the  receptor  Sj . 
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Figure  4:  The  architecture  of  the  two-stage  mixture  character  recognizer. 
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Abstract 

A  dynamic  neural  network  is  developed  to  detect  soft  failures  of  sensors  and 
actuators  in  automobile  engines.  The  network,  currently  implemented  off¬ 
line  in  software,  can  process  multi-dimensional  input  data  in  real  time.  The 
network  is  trained  to  predict  one  of  the  variables  using  others.  It  learns  to 
use  redundant  information  in  the  variables  such  as  higher  order  statistics  and 
temporal  relations.  The  difference  between  the  prediction  and  the  measure¬ 
ment  is  used  to  distinguish  a  normal  engine  from  a  faulty  one.  Using  the 
network,  we  are  able  to  detect  errors  in  the  manifold  air  pressure  sensor  (Vs) 
and  the  exhaust  gas  recirculation  valve  (14)  with  a  high  degree  of  accuracy. 

1  Introduction 

The  basic  behavior  of  an  automotive  engine  is  well  known  (Dobner  1983, 
Cook  and  Powell  1988).  In  the  intake  manifold  of  an  automotive  engine, 
shown  schematically  in  Figure  1,  the  mass  air  flow  rate  (Vi),  exhaust  gas 
re-circulation  valve  position  (14),  engine  speed  (14),  and  manifold  absolute 
pressure  (14)  are  related  by  a  first  order  dynamics: 

dV,/dt=F(Vi,Vo,Va,Vs). 

In  many  automobiles,  sensors  directly  measure  the  variables  14,  Vi,  and 
K,  and  the  actuator  command  Va  is  also  monitored.  However,  the  above 
equation  indicates  that  there  is  a  redundancy  between  these  variables.  The 

*To  whom  correspondence  should  be  addressed.  Phone:  818-395-2805.  Fax:  818-792- 
7402.  Email:  dawei@hope.caltech.edu 
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consistency  of  the  time-history  of  the  four  variables  can  be  used  to  check 
for  faults  in  the  three  sensors  and  in  the  actuator.  Thus  for  example  by 
monitoring  the  variable  Vs,  we  should  be  able  to  reliably  detect  errors  in 
variables  such  as  We  present  a  neural  network  model  that  can  capture  the 
above  dynamics  of  a  six-cylinder  engine  on  a  production  vehicle.  Even  though 
the  neural  network  presented  here  is  for  a  specific  engine  diagnostic  problem, 
the  approach  is  quite  general  and  can  be  easily  used  for  other  applications  as 
well. 


Figure  1:  Engine  flow  diagram.  There  are  two  in-flows  Vi  and  Va  and  one 
out-flow  Vo.  Because  the  conservation  of  mass,  the  change  of  Va,  which  is 
proportional  to  the  total  change  of  maiss  in  the  manifold,  is  proportional  to  the 
net  mass  flow,  which  is  a  function  of  Vi,  Va,  Vo,  and  Vs. 


2  Network 

A  two  layer  feedback  neural  network  is  developed  to  predict  one  variable  (Vs) 
using  three  others.  The  architecture  of  the  network  is  illustrated  in  Figure  2. 
The  network  has  3  feed  forward  inputs  (Vi,  Va,  and  V?),  16  first  (hidden)  layer 
neurons,  and  1  second  (output)  layer  neuronV  The  predicted  Vs  is  fed  back 
as  the  fourth  input.  The  facts  that  (i)  the  first  layer  uses  time-delayed  output 
variables  and  (ii)  the  input-output  relationship  of  each  neuron  is  sigmoidal, 
allows  the  network  to  capture  the  knowledge  that  the  physical  system  is 
characterized  by  a  first  order  non-linear  dynamics. 

t There  is  an  extensive  body  of  literature  on  fault  diagnosis.  The  inherent  relationships 
and  redundancies  of  measured  variables  of  dynamic  processes  are  often  used  to  detect  faults 
(e.g.  Isermann  1993). 

^Different  number  of  hidden  neurons  has  been  tried.  16  gives  a  good  level  of  performance 
for  this  task 
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The  data  for  training  and  testing  the  network  was  collected  by  a  lap¬ 
top  computer  during  normal  city  and  highway  driving  using  an  experimental 
hardware  setup  within  the  vehicle.  The  data  was  then  loaded  to  a  SUN  work¬ 
station  for  training  and  testing.  We  focus  on  faults  in  the  sensor  Vs  and  the 
actuator  Vq.  The  latter  was  chosen  for  its  difficulty  in  detection.  The  faults 
were  introduced  by  a  hardware  fault  generator  and  simulates  an  80%  Vs  fault 
or  an  80%  14  fault.  (80%  14  fault  means  that  the  sensor  14  reading  is  80% 
of  the  real  value.  80%  14  fault  means  that  in  the  local  actuator  control  loop 
for  Va,  the  sensor  output  is  80%  of  the  actual  value.  This  will  cause  the 
actuator  14  to  open  more  for  a  given  14  command,  thereby  increasing  the  14 
by  roughly  5%.) 


^(t)  Va{t)  Vo{t)  Vs(t) 


Figure  2;  Network  architecture.  This  feedback  network  has  two  layers  of  neu¬ 
rons,  there  are  total  of  80  connections  (five  for  each  hidden  neuron)  and  17 
thresholds  (one  for  each  neuron).  Those  97  parameters  of  the  network  are 
trained  by  back  error  propagation  (BEP). 

The  connections  of  the  network  are  trained  by  back  error  propagation 
(BEP)  with  a  momentum  term  on  a  training  data  set.  To  learn  the  dynamic 
correctly,  the  training  data  are  presented  in  the  following  fashion: 

1)  find  a  random  starting  point  in  a  long  time  sequence  of  data,  set 
the  initial  value  of  the  feedback  input  to  the  measured  14; 

2)  run  the  input  through  the  network  to  get  an  output  14,  calculate 
the  output  error  (the  square  difference  of  the  predicted  14  and  the 
real  one); 

3)  set  the  feed  forward  input  to  the  next  data  point  and  the  feedback 
input  to  the  predicted  14; 

4)  repeat  steps  (2)  and  (3)  for  100  steps  to  collect  the  error  signal; 
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5)  repeat  step  (1),  (2),  (3)  and  (4)  for  4  steps  to  further  collect  the 
error  signal; 

6)  update  the  connection  according  the  BEP  learning  rule; 

7)  repeat  (1)  through  (6)  until  the  error  does  not  reduce  any  more  or 
until  the  limit  of  computation  are  reached. 

The  performance  of  the  network  is  tested  on  a  separate  validating  data  set. 
In  all  the  plots,  except  the  one  mentioned,  the  validating  data  set  is  used. 

3  Network  Performance 

The  purpose  of  training  the  close-loop  (feedback)  network  is  to  facilitate  the 
identification  of  a  normal  vehicle  from  a  faulty  one,  and  thus  diagnose  a  fault. 
The  variance  (root  of  mean  square)  of  the  distribution  of  the  difference  be¬ 
tween  the  predicted  and  the  measured  Vg  is  used  as  the  quantitative  measure 
of  the  network’s  identification  power. 

3.1  System  Identification 

The  trained  neural  network  predicts  Vg  value  very  well.  This  is  shown  in 
Figure  3  (left)  for  a  segment  of  the  normal  data  set,  sampled  every  25  ms  for 
400  seconds,  under  normal  city  driving  conditions.  The  predicted  Vg  values 
(dashed  line)  and  the  measured  Vg  values  are  very  close  to  each  other. 


Figure  3:  Vs  prediction  for  normal  vehicle  (left)  and  faulty  vehicle  (right). 

With  a  network  of  this  accuracy,  it  is  easy  to  detect  faults  in  sensor  Vg. 
Figure  3  (right)  shows  a  segment  of  the  data  set  which  was  collected  with 
faulty  Vg  sensor  (the  reading  is  80%  of  the  true  value).  It  is  the  same  vehicle 
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but  with  an  altered  Vs  sensor.  We  can  see  that  the  prediction  is  about  20% 
above  the  measured  value,  i.e.,  the  network  predicts  a  Vs  value  20%  greater 
than  the  actual  reading,  given  other  variables. 

But  if  a  fault  only  causes  small  changes  in  K,  it  is  not  so  easy  to  see  the 
difference  from  a  plot  like  Figure  3.  For  the  data  set  which  was  collected 
with  the  faulty  K  actuator  (the  Va  actuator  opens  20%  more  than  the  K 
command),  the  changes  in  Vs  is  only  about  5%.  The  difference  between  the 
predicted  and  measured  Vs,  i.e.,  the  residual,  gives  a  quantitative  measure. 
The  distribution  of  the  residual  is  quite  different  for  a  normal  vehicle  and  a 
faulty  one. 

3.2  Diagnostic  Variable 

Figure  4  (left)  shows  the  Vg  residual  values  for  the  same  segment  of  the  data 
set  as  in  Figure  3  (left)  for  a  normal  vehicle.  The  residuals  are  well  within 
1  with  the  mean  close  to  0.  Obviously  there  are  many  ways  to  characterize 
the  residual,  e.g.,  binning  it  for  different  Vg  values  and/or  Va  values.  Even 
the  residuals  at  different  times  could  give  information  on  whether  a  vehicle  is 
normal  or  not. 


time  (sec)  time  (sec) 

Figure  4:  Vg  residual  for  normal  vehicle  (left)  and  a  vehicle  with  Va  fault  (right). 

The  residual  variance  is  the  most  natural  one  to  characterize  the  spread 
of  the  residual  distribution.  For  the  current  application,  this  is  sufficient  to 
separate  a  normal  vehicle  from  a  faulty  one.  For  the  the  segment  of  the 
normal  data  set  shown  in  Figure  3  (left)  the  running  average  of  the  residual 
variance  is  shown  in  Figure  5  (left,  the  curve  in  the  middle).  It  is  clear  that 
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the  residual  variance  converges  to  0.8  within  about  200  sec.  This  translates 
to  roughly  1.8%  of  the  measured  Vg.  We  can  see  from  Figure  5  (right)  that 
the  V5  residual  variance  of  a  vehicle  with  80%  K  fault  is  much  larger  than 
this,  so  it  is  possible  to  detect  the  Va  fault  with  average  K  residual  variance. 

3.3  Discrimination  Power 

Since  the  Vg  fault  is  easy  to  detect,  only  the  performance  of  the  network  for 
detecting  80%  I4  fault  is  presented  in  the  following.  Figure  4  (right)  shows 
the  Vg  residual  values  for  an  80%  Va  fault  vehicle.  The  Vg  residuals  have  much 
larger  variances,  in  contrast  to  those  in  Figure  4  (left)  for  a  normal  vehicle. 


Figure  5:  Vs  residual  variance  for  normal  vehicle  (left)  arid  a  vehicle  with  Va 
fault  (right).  Three  segments  are  plotted  for  each  of  the  normal  data  set  and 
the  faulty  data  set.  The  variances  for  a  faulty  vehicle  are  larger  than  the  normal 
ones  by  more  than  a  factor  of  two. 


Figure  5  shows  the  running  average  of  Vg  residual  variances  for  three 
segments  of  the  normal  data  set  (left)  and  three  segments  of  the  Va  fault 
data  set  (right).  The  Vg  residual  variances  for  the  faulty  data  set  are  around 
1.8  ,  more  than  two  times  larger  than  0.8  variance  for  the  normal  data  set. 
Again,  the  variances  approach  their  asymptotic  values  within  about  200  secs. 
The  small  difference  from  segment  to  segment  reflects  the  random  driving 
pattern  during  the  data  collection  (there  was  no  set  driving  schedule). 

4  Network  Generalization 

The  most  serious  concern  for  any  data  dependent  model  (neural  network 
and  math  based  models  alike)  is  how  well  the  model  generalizes.  This  is 
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investigated  in  two  ways:  variations  from  the  training  data  to  validating 
data,  variations  for  different  drivers. 

4.1  Training  and  Validating 

Figure  6  (left)  shows  the  V;  residual  variances  on  three  segments  of  the  train¬ 
ing  data  set.  These  can  be  compared  with  the  three  segments  of  the  val¬ 
idating  data  set  in  Figure  5  (left).  Both  the  training  and  validating  data 
were  collected  for  the  same  driver  A  in  this  case.  All  the  running  averages 
of  14  residual  variances  approach  0.7  to  0.9  after  200  seconds.  There  is  no 
significant  difference  in  the  variance  for  training  and  validating  data. 


time  (sec)  time  (sec) 


Figure  6:  Vs  residual  variance  for  training  (left)  and  for  different  drivers  (right). 

On  the  right,  the  lower  two  curves  are  for  driver  B  and  the  upper  one  is  for 
driver  C. 

4.2  Different  Drivers 

Figure  6  (right)  shows  the  14  residual  variances  of  three  segments  of  normal 
data  from  two  other  drivers  (B  and  C)  which  can  be  compared  with  the  three 
segments  of  the  validating  data  set  as  before  (Figure  5,  left).  The  network 
was  trained  for  driver  A.  Thus  the  data  for  different  drivers  are  not  part  of  the 
training  data  set.  The  running  averages  of  14  residual  variances  for  drivers  B 
and  C  are  only  slightly  higher,  ranging  from  0.7  to  1.1. 

The  small  difference  in  performance  for  training,  validating,  and  drivers 
is  well  below  the  level  of  14  change  caused  by  the  14  fault.  Thus  the  network 
can  still  give  a  reliable  signal  to  detect  the  fault. 


5  Discussion 

Based  on  the  intrinsic  dynamic  of  the  intake  manifold  of  an  automotive  engine 
(shown  in  Figure  1),  We  choose  a  feedback  network  instead  of  a  feedforward 
one.  To  test  what  a  feedforward  network  can  do,  we  also  trained  a  network 
without  the  Vs  feedback,  i.e.,  a  standard  two  layer  feedforward  network. 

Figure  7  shows  the  performance  of  the  trained  feedforward  network.  The 
running  average  of  K  residual  variances  for  three  segments  of  the  normal 
data  set  (left)  and  three  segments  of  the  Va  fault  data  set  (right)  are  shown 
in  this  figure.  The  prediction  accuracy  of  the  feedforward  network  is  much 
lower  than  the  feedback  network  (Figure  5).  The  variances  for  the  normal 
data  set  are  two  times  larger  than  the  feedback  network  and  are  very  close 
to  the  variances  for  the  Va  fault  data  set. 


time  (sec) 
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Figure  7:  Performance  of  a  feedforward  network.  As  in  Figure  5,  Va  residual 
variance  for  normal  vehicle  is  shown  on  the  left  and  a  vehicle  with  Va  fault  on 
the  right.  Three  segments  are  plotted  for  each  of  the  normal  data  set  and  the 
faulty  data  set.  Different  from  Figure  5,  the  variances  for  normal  and  faulty 
vehicles  are  not  very  far  apart. 


Another  alternative  is  to  train  a  feedforward  network  with  true  Vs,  not 
the  feedback  from  the  output.  With  the  same  training  schedule,  the  network 
learned  mostly  to  follow  the  Vs  input.  Thus  the  performance  is  even  worse 
than  without  the  Vs  input  —  in  term  of  discriminating  faulty  and  normal 
vehicles. 

We  have  also  trained  feedforward  networks  with  multiple  time-delayed 
inputs  of  Vi,  Va,  and  K.  They  have  similar  level  of  performance  as  the 
feedforward  network  in  Figure  7,  which  is  much  worse  than  the  feedback  one. 
Thus  the  feedback  element  of  the  network  is  truly  important. 
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6  Summary 


This  research  demonstrates  the  usefulness  of  applying  neural  network  tech¬ 
nology  for  engine  modeling  and  diagnostics.  Using  a  well  accepted  statistical 
measure  —  the  variance  —  the  two  layer  network  with  feedback  achieved  a 
accurate  manifold  air  pressure  (14)  prediction  with  1.8%  variance  which  en¬ 
ables  the  detection  of  4.5%  14  variance  caused  by  exhaust  gas  recirculation 
valve  (14)  faults  of  the  same  vehicle. 

We  should  point  out  that  it  is  not  necessary  to  collect  the  data  in  con¬ 
tinuous  400  sec  windows.  For  the  current  method  to  work,  one  only  needs 
to  collect  small  pieces  of  data  say,  2  or  3  seconds  long,  and  collect  many 
pieces  to  accumulate  enough  statistics.  On  the  other  hand,  collecting  and 
processing  data  continuously  every  25  ms  itself  is  not  very  demanding.  The 
computational  needs  for  processing  data  after  the  network  has  been  trained 
is  only  about  4000  multiplication  per  second. 
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Abstract  -  We  introduce  a  new  generalized  feed-forward  structure  that 
provides  for  multiple  time  scales.  The  gamma,  Laguerre  and  other  locally 
recurrent  feed-forward  structures  perform  poorly  in  cases  where  widely 
varying  time  constants  are  required.  By  exponentially  varying  the  time- 
constant  along  the  delay  line,  a  single  delay  line  is  able  to  represent 
signals  that  include  various  time  scales.  We  demonstrate  both  discrete- 
and  continuous-time  versions  of  this  multiple  time-scale  structure  which 
we  call  the  multi-scale  gamma  filter.  The  multi-scale  gamma  has  a  very 
natural  implementation  in  sub-threshold  CMOS  and  measured  impulse 
responses  from  a  continuous-time  analog  VLSI  chip  are  shown. 


1  INTRODUCTION 

For  many  practical  problems,  the  gamma  structure  is  superior  to  the  standard 
tap  delay  line  because  of  its  ability  to  automatically  choose  an  appropriate 
time-scale  [1]  [2]  [3].  This  advantage  becomes  particularly  significant  for 
problems  involving  extremely  long  impulse  responses  for  which  the  standard 
tap  delay  line  solutions  can  require  thousands  of  taps.  Unfortunately,  the 
gamma  structure  has  a  few  problems  of  its  own: 

1.  Choosing  the  optimal  time  scale  is  a  very  difficult  computational  prob¬ 
lem.  Gradient  descent  is  not  guaranteed  to  find  the  optimal  time 
scale  [4].  This  problem  becomes  particularly  troublesome  when  we  build 
dedicated  hardware  (analog  or  digital)  for  implementing  these  filters. 

2.  Even  when  a  single  optimal  time-scale  can  be  found,  the  structure  may 
not  be  able  to  efficiently  represent  information  occurring  at  other  time 
scales. 
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3.  As  with  the  FIR  case,  choosing  an  appropriate  number  of  taps  is  also  a 
difficult  optimization  procedure. 

Sections  II  and  III  discuss  the  discrete-  and  continuous-time  realizations  of 
a  multiple  time-scale  generalization  of  the  gamma  filter  which  addresses  the 
above  problems.  Section  IV  discusses  an  analog  VLSI  hardware  implementa¬ 
tion  of  this  filter. 


2  DISCRETE-TIME  MULTI-SCALE  GAMMA  FILTER 


Our  proposed  discrete-time  multi-scale  gamma  filter  is  shown  in  Figure  1. 
Unlike  the  gamma  filter,  the  location  of  the  pole  at  each  stage  depends  on 
the  tap  k.  The  transfer  function  between  taps  can  be  written  as: 


Xkjz)  _  ^  ‘  fj, 

Xk-i{z)  z  -  (1  -  •  fi) 


(1) 


(We  assume  that  a  <  1.)  If  we  perform  a  two-dimensional  search  for  the  op¬ 
timal  a  and  this  structure  cannot  perform  worse  than  the  original  gamma, 
since  the  gamma  filter  is  included  as  a  special  case  when  a  =  1.  We  can 


I-^i  l-a)i  l-a'‘V 


Figure  1:  Discrete-time  multi-scale  gamma  filter 

demonstrate  a  problem  for  which  the  multi-scale  filter  easily  outperforms  the 
gamma  filter  by  posing  a  system  ID  problem  that  includes  two  widely  sepa¬ 
rated  poles.  We  use  the  transfer  function  H(z): 


H{z)  = 


0.3 

(^-0.05)(^-0.95) 


(2) 


For  the  gamma  structure,  we  scan  all  possible  values  of  //  between  0  and  1. 
For  convenience,  we  set  /i  =  1  in  the  multi-scale  gamma  so  that  the  first 
stage  is  exactly  an  ideal  delay.  We  then  scan  all  values  of  of  a  between  0 
and  1.  The  Wiener-Hopf  equations  were  used  to  solve  for  the  optimal  weight 
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values  in  order  to  obtain  the  MSE.  The  results  for  various  numbers  of  taps 
(from  2  to  5)  are  shown  in  Figure  2.  In  all  cases,  the  gamma  filter  has  more 
difficulty  in  approximating  the  system  with  a  small  number  of  taps.  Since 
we  are  scanning  all  values  of  the  free  parameter,  a  practical  optimization 
procedure  is  still  an  open  problem.  The  solution  for  the  multi-scale  gamma 
can  be  further  improved  if  we  also  optimize  for  fj,  instead  of  setting  ^  =  1. 
Currently  we  are  not  exploring  this  direction  because  we  seek  simple  search 
methods  that  can  be  readily  implemented  in  dedicated  hardware.  The  multi- 


Figure  2:  Mean  square  error  comparisons  between  the  gamma  filter  (solid  line)  and 
the  multi-scale  gamma  filter  (dotted  line).  We  plot  MSB  vs  p  (for  gamma)  or  a  for 
(multi-scale  gamma). 

scale  gamma  structure  is  now  better  able  to  represent  signals  with  widely 
varying  time  constants  but  we  still  require  a  difficult  optimization  procedure 
to  find  the  optimal  a  (assuming  p  =  1).  Rather  than  perform  this  difficult 
search  procedure,  we  borrow  a  standard  weight  pruning  technique  from  neural 
network  theory  [5].  We  purposely  include  more  taps  than  we  need  in  order  to 
cover  a  large  range  of  time  scales  and  selectively  deactivate  any  weight  values 
that  are  small  in  magnitude. 

We  have  simulated  a  system  ID  problem  using  the  multi-scale  gamma  filter 
in  which  the  transfer  function  of  the  unknown  system  is  given  by: 

-  1) 

'  z(2-0.1)(^-0.7)(^-0.9) 

The  Mean  Square  Error  vs.  number  of  taps  is  shown  in  Figure  3.  For  an 
n-tap  filter,  the  smallest  10  —  n  weight  magnitudes  are  set  to  zero.  We  again 
assume  that  p  =  1.  There  are  a  few  things  to  note  in  Figure  3.  First, 
the  a  =  0.9  solution  is  better  than  a  =  0.5  for  an  equal  number  of  taps. 
For  a  ~  0.5,  the  poles  are  spaced  too  far  apart.  Second,  for  both  values 
of  a  there  is  a  sharp  transition  beyond  which  adding  more  taps  does  not 
decrease  the  error.  This  sharp  transition  can  be  used  to  choose  a  reasonable 


647 


Figure  3:  Plot  of  MSB  vs.  number  of  taps  using  the  weight  pruning  method.  For 
each  number  of  taps,  the  smallest  magnitude  weights  are  set  to  zero. 

number  of  taps  for  each  problem.  Minimizing  the  number  of  taps  reduces 
the  overall  aniount  of  computation  and  also  lowers  the  misadjustment  of  the 
system.  We  plan  to  use  this  weight  pruning  method  to  avoid  implementing 
complex,  non-convex  optimization  procedures  in  our  dedicated  hardware.  The 
weight  pruning  method  also  provides  a  mechanism  for  choosing  an  appropriate 
number  of  taps. 


3  CONTINUOUS-TIME  MULTI-SCALE  GAMMA  FILTER 

Our  continuous-time  multi-scale  gamma  (ms-gamma)  structure  is  shown  in 
Figure  4.  The  ms-gamma  is  a  cascade  of  first-order  low-pass  filters  with 
time  constants  that  slow  down  exponentially  as  signals  propagate  down  the 
cascade.  If  we  define  the  time  constant  of  the  last  stage  to  be  r,  then  the 
next  to  the  last  stage  has  a  time  constant  of  ar  where  0  <  a  <  1.  Since 


Gk(s)  G/s)  Gi(s)  Go(s) 


Figure  4:  Continuous-time  multi-scale  gamma  memory 

the  time  constant  changes  by  a  factor  of  a  for  each  stage,  if  we  set  a  = 
1,  the  ms-gamma  reduces  to  the  usual  gamma  memory.  We  can  simplify 
the  mathematical  analysis  by  considering  an  infinite  cascade  of  sections.  In 
general,  for  a  <  1 


Hk{s)  = 


a^TS  -t- 1 


(4) 
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And  full  system  response  at  tap  k  is  given  by 

oo  . 

Gi(s)  =  TT  ,  (5) 

k=i 

Because  of  the  infinite  cascade,  the  following  scaling  laws  can  be  easily  de¬ 
rived.  In  the  s-domain: 

=  (as)  (6) 

In  the  time-domain: 

gi+dt)  =  lgiC-)  (7) 

Cv  (Jf 

Therefore  the  impulse  functions  gi{t)  are  all  identical  with  the  exception  of  a 
scaling  of  the  amplitude  and  time  axis. 

We  have  derived  an  analytic  form  of  the  impulse  response  at  each  tap  for  the 
ms-gamma  for  both  the  finite  and  infinite  cascade  versions.  Both  expressions 
consist  of  a  weighted  sum  of  exponentials.  The  expression  ,of  the  impulse 
response  of  the  infinite  cascade  is  simpler  and  can  be  written  as: 


where 


Figure  5  shows  the  impulse  response  curves  of  ms-gamma  both  simulated 
and  measured.  The  peak  values  of  the  impulse  response  are  equally  spaced 


Figure  5:  (a)  Simulated  multi-scale  gamma  kernels,  (b)  Measured  kernels. 
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for  the  gamma  filter  but  are  not  equally  spaced  for  the  ms-gamma.  For  the 
infinite  cascade,  define  U  to  be  the  peak  value  of  the  impulse  response  gi{t), 
the  scaling  law  results  in  the  following  relation: 

ti+i  =  ati  (10) 

which  means  that  the  time  of  the  peak  at  stage  2  -H  1  is  simply  the  product 
a  and  the  time  of  the  peak  at  stage  i.  This  implies  that  the  peak  value  of 
the  impulse  response  for  consecutive  taps  of  the  infinite  cascade  are  equally 
spaced  on  a  log  time  plot.  Figure  6  shows  the  peak  location  of  10  normalized 
tap  responses  on  a  log  time  plot  for  the  finite  ms-gamma  memory.  Notice 
that  after  the  first  few  taps,  the  peak  values  become  equally  spaced  and  the 
impulse  response  shapes  converge  to  the  same  shape.  This  is  exactly  what  is 
expected  from  the  infinite  cascade  analysis. 


Figure  6:  Impulse  responses  of  the  multi-scale  gamma  memory  on  a  log-time  plot. 
The  peak  value  of  each  impulse  response  has  been  normalized  to  unity  for  display 
purposes. 

We  have  also  performed  system  identification  simulations  using  the  ms-gamma 
filter  in  continuous- time.  Figure  7  shows  that  the  ms-gamma  performance 
index  is  fairly  flat  when  ^  is  larger  than  800.  Since  the  performance  surface 
is  fairly  constant  and  close  to  its  optimal  value  in  this  region,  finding  the  op¬ 
timal  r  may  not  be  so  important  for  this  structure.  Figure  8  shows  that  when 
the  number  of  taps  {k)  increases,  the  performance  index  becomes  even  more 
flat.  With  our  example,  the  unknown  system  is  a  5^*  order  system  and  using 
a  5^^  order  ms-gamma  filter  is  enough  to  approximate  the  unknown  system. 
Increasing  the  number  of  taps  to  A;  =  6  does  not  provide  much  improvement. 
These  results  suggest  that  there  may  be  no  need  to  find  an  optimal  scale  (as 
is  necessary  for  the  gamma)  if  many  time  scales  are  explored  simultaneously 
(as  in  the  ms-gamma).  The  misadjustment  of  the  system  may  be  reduced  by 
systematically  zeroing  out  any  weights  that  do  not  contribute  significantly  to 
the  output. 
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dashed  line  :  multiple  time-scale  filter 
tap  number  k  =  5 


inverse  time  constant 


Figure  7:  Performance  index  comparison  of  gamma  and  multi-scale  gamma  struc 
tures. 


Figure  8;  Performance  index  of  multi-scale  gamma  with  different  numbers  of  taps 
(k). 

4  ANALOG  HARDWARE  IMPLEMENTATION 

We  have  previously  implemented  the  gamma  filter  in  analog  VLSI  [6]  and 
have  integrated  a  continuous-time  LMS  gradient  decent  method  to  determine 
the  weight  values  [7].  In  order  to  implement  the  multi-scale  filter,  a  resistive 
line  is  connected  along  the  tap  bias  controls  to  achieve  a  linear  voltage  drop 
from  one  end  to  the  other.  Because  the  CMOS  transamps  are  operated  in  the 
sub-threshold  region,  the  output  currents  are  exponential  in  the  bias  voltages, 
which  means  their  poles  also  exponential  decreasing  as  k  increases.  A  similar 
strategy  was  used  in  the  implementation  of  the  silicon  cochlea  [8].  Figure  9 
shows  the  schematic  for  the  A:-tap  multi-scale  gamma  structure.  The  resistors 
are  chosen  with  the  same  value,  the  time  constant  r  is  controlled  by  the 


Figure  9;  Circuit  implementation  of  the  multi-scale  gamma  structure. 

voltage  Vhigh  and  factor  a  is  set  by  Viow  A  twelve-tap  multi-scale  gamma 
structure  had  been  fabricated  using  MO  SIS  2pim  N-WELL  technology.  The 
measured  impulse  responses  from  the  chip  are  shown  in  Figure  5(b).  This 
analog  implementation  provides  a  fast,  low-cost,  low-power  solution  for  many 
adaptive  filtering  applications. 


5  CONCLUSION 

We  have  introduced  the  multi-scale  gamma  to  allow  for  widely  varying  time- 
scales  in  the  input  signals.  The  gamma,  Laguerre  and  other  locally  recurrent 
feedforward  structures  perform  poorly  in  cases  where  widely  varying  time 
constants  are  required.  By  exponentially  varying  the  time  constant  along  the 
delay  line,  a  single  delay  line  is  able  to  represent  signals  that  include  various 
time  scales.  Our  results  also  suggest  that  we  may  be  able  to  skip  the  difficult 
search  procedures  required  to  find  a  single  optimal  time  constant  as  is  nec¬ 
essary  for  the  standard  gamma  and  Laguerre  filters.  We  demonstrate  both 
discrete-  and  continuous-time  versions  of  the  multi-scale  gamma  structure. 
The  same  extension  can  be  applied  to  the  Laguerre  memory  and  most  other 
locally  recurrent  networks.  These  multiple  time-scale  structures  have  a  natu¬ 
ral  implementation  is  sub-threshold  CMOS  and  measured  impulse  responses 
from  a  continuous-time  multi-scale  gamma  chip  were  shown. 
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A  probabilistic  neural  network  based  technique  is  presented  for 
unsupervised  quantification  and  segmentation  of  the  brain  tissues 
from  magnetic  resonance  image.  The  problem  is  formulated  as 
distribution  learning  and  relaxation  labeling  that  may  be  partic¬ 
ularly  useful  in  quantifying  and  segmenting  abnormal  brain  tis¬ 
sues  where  the  distribution  of  each  tissue  type  heavily  overlaps. 
The  new  technique  utilizes  suitable  statistical  models  for  both 
the  pixel  and  context  images.  The  quantification  is  achieved  by 
model-histogram  fitting  of  probabilistic  self-organizing  mixtmes 
and  the  segmentation  by  global  consistency  labeling  through  a 
probabilistic  constraint  relaxation  network.  Experimental  results 
show  the  efficient  and  robust  performance  of  the  new  algorithm. 


1.  INTRODUCTION 

Quantitative  analysis  of  brain  tissues  refers  to  the  problem  of  estimat¬ 
ing  tissue  quantities  from  a  given  image  and  segmentation  of  the  image 
into  contiguous  regions  of  interest  to  describe  the  anatomical  structures. 
The  problem  has  recently  received  much  attention  largely  due  to  the  im¬ 
proved  fidelity  and  resolution  of  medical  imaging  systems,  and  because  of 
its  ability  to  deliver  high  resolution  and  contrast,  magnetic  resonance  (MR) 
imaging  has  been  the  dominant  modality  for  research  on  this  problem.  For 
quantification  of  brain  tissues  from  MR  images,  stochastic  model  based 
approach  has  been  by  far  the  most  popular.  The  stochastic  model  based 
approach  typically  employs  a  finite  mixture  model,  which  we  have  shown 
in  our  recent  study  of  MR  image  statistics,  is  a  very  suitable  model  for  the 
task.  Therefore,  probabilistic  neural  networks  are  particularly  suitable  for 
application  in  quantitative  analysis  of  MR  images,  since  while  providing  a 
formal  statistical  formalization  of  the  problem  they  also  offer  efficient  on¬ 
line  computation  of  the  quantities  of  interest,  a  feature  especially  important 
for  evaluation  of  studies  in  a  clinical  setting,  for  example  an  analysis  to  be 
performed  on  a  sequence  of  MR  images. 
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In  this  paper,  we  present  a  probabilistic  neural  network  approach  for 
efficient  analysis  of  brain  tissues  by  using  single-valued  MR  brain  scans. 
The  procedure  provides  a  complete  treatment  of  the  problem  of  quantita¬ 
tive  image  analysis  in  that,  a  quantification  stage  that  includes  automatic 
determination  of  tissue  types  is  followed  by  a  segmentation  stage  which  in¬ 
corporates  local  spatial  context  with  global  statistical  description  of  pixel 
intensities  for  reliable  description  of  the  anatomical  structures.  In  partic¬ 
ular,  we  formulate  tissue  quantification  as  a  distribution  learning  problem 
and  use  relative  entropy  as  the  information  distance  between  the  standard 
finite  normal  mixture  (SFNM)  distribution  and  the  image  histogram.  The 
actual  quantification  is  performed  by  a  probabilistic  self-organizing  mix¬ 
tures  (PSOM)  network  in  which  we  use  an  information  theoretic  criterion, 
the  minimum  conditional  bias/variance  (MCBV)  criterion,  to  determine 
the  suitable  number  of  mixture  components  in  a  given  MR  image.  The 
procedure  is  fully  unsupervised  so  it  is  able  to  work  with  both  normal  and 
abnormal  cases.  The  actual  segmentation  is  performed,  after  quantifica¬ 
tion,  through  combining  maximum  likelihood  thresholding  and  stochastic 
regularization,  by  a  probabilistic  constraint  relaxation  network  (PCRN). 
Experimental  results  demonstrate  the  efficient  and  reliable  performance  of 
the  proposed  scheme,  in  terms  of  the  quantification  achieved  by  PSOM, 
consistency  of  the  order  determination  by  using  the  proposed  information- 
theoretic  criterion,  MCBV,  and  the  final  segmentation  results  by  PCRN. 


2.  PROBLEM  STATEMENT 

Over  the  last  few  years,  considerable  success  has  been  reported  in  MR 
image  analysis  both  by  using  finite  mixture  distributions  and  by  neural 
networks  based  methods.  And  very  recently,  a  cross  fertilization  of  these 
two  approaches,  probabilistic  neural  networks  have  emerged  as  a  powerful 
tool  in  MR  image  analysis  such  as  tissue  quantification  and  segmentation. 
New  approach  provides  valuable  insight  for  designing  and  learning  in  neural 
networks,  such  as  consistency  of  parameter  estimates  and  determination  of 
suitable  network  structure  among  others. 

Assume  that  the  spatial  location  of  each  pixel  Xi,  has  one-to-one  cor¬ 
respondence  to  its  true  label  /?.  By  randomly  reordering  all  pixels  in  the 
underlying  probability  space,  i.e.,  ignoring  information  regarding  the  spa¬ 
tial  ordering  of  pixels,  we  can  treat  pixel  labels  as  random  variables  and 
introduce  a  probability  measure  by  using  a  multinomial  distribution  with 
unknown  parameters  tt*  for  each  component.  Since  it  reflects  the  distribu¬ 
tion  of  the  number  of  pixels  in  each  component,  7r/k  can  be  interpreted  as  a 
prior  probability  of  the  global  context  information.  Thus,  the  relevant  (suf¬ 
ficient)  statistics  are  the  tone  statistics  for  each  component  and  the  number 
of  pixels  in  each  of  the  component.  The  marginal  probability  measure  for 
any  pixel  image,  i.e.,  the  SFNM  distribution,  can  be  obtained  by  writing 
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the  joint  probability  density  of  Xi  and  and  then  summing  the  joint  density 
over  all  possible  outcomes  of  resulting  in  a  sum  of  the  following  general 
form: 

K 

/(u|r)  =  ^  ■!rkg{u\iik,  cl)  (1) 

k=l 

with  ^k=i 


9{'u\ttk,cl)  =  2al^  ^ 

where  Hk  and  cr|  are  the  mean  and  variance  of  the  kth  Gaussian  kernel.  We 
use  K  to  denote  the  number  of  Gaussian  components  and  r  €  to 

denote  the  toted  parameter  vector  that  includes  and  for  all  K 

components. 

On  the  other  hand,  since  in  tissue  segmentation,  context  information  is 
of  particular  importance,  by  assuming  that  the  context  images  are  random 
variables  with  Markovian  property,  a  localized  SFNM  distribution  can  be 
formulated  to  incorporate  local  regularities  statistically,  i.e.,  to  impose  local 
consistency  constraints  on  context  images  in  terms  of  a  stochastic  regular¬ 
ization  scheme.  For  each  pixel  2,  we  define  the  spatial  constraint  as  a  local 
set  of  all  pairs  such  that  the  consistency  between  U  and  Ij  can  be 

measured  by  the  compatibility  function  We  define  /(•,*)  as  the 

indicator  function  and  define  the  neighborhood  of  pixel  i,  di  by  opening  a 
6x6  window  with  pixel  i  being  the  central  pixel  where  6  is  assumed  to  be  an 
odd  integer.  Note  that  pairs  of  labels  are  either  compatible  or  incompatible 
in  this  case.  Then,  we  compute  the  frequency  of  neighbors  of  pixel  i  with 
labels  compatible  to  an  assumed  label  of  pixel  i,  denoted  by  given  the 
labels  of  its  neighbors  la*  G  by 

4*)  =  P(l,  =  fcllaO  =  (2) 

jedi 

and  the  localized  SFNM  distribution  for  x*  directly  follows  by 

where  is  interpreted  as  the  conditional  prior  of  component  determined 
by  the  uncertainty  introduced  by  la*. 

Tissue  quantification  addresses  the  combined  estimation  of  regional  com¬ 
ponent  parameters  (7r;k,//fc,(7^)  and  the  detection  of  the  structural  param¬ 
eter  K  in  Eq.  (1)  given  the  pixel  images  x.  A  distance  minimization 
approach  is  developed  where  the  mixture  density  is  fitted  to'  the  histogrcun 
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of  the  data  by  finding  the  optimal  parameters  with  respect  to  a  distance 
measure.  We  use  relative  entropy  (the  Kullback-Leibler  distance)  [?]  for 
tissue  quantification  in  MR  images.  Relative  entropy  measures  the  infor¬ 
mation  theoretic  distance  between  the  “true”  distribution  /x(«)  and  the 
estimated  SFNM  distribution  /(ulr),  and  is  given  by 

(4) 

Note  that  the  use  of  the  relative  entropy  cost  also  overcomes  problems 
such  as  convergence  at  the  wrong  extreme  faced  by  the  squared  error  cost 
function  as  it  weighs  errors  more  heavily  when  probabilities  are  near  zero 
and  one,  and  diverges  in  the  case  of  convergence  at  the  wrong  extreme 
[4].  We  have  shown  that,  when  relative  entropy  is  used  as  the  distance 
measure,  distance  minimization  is  equivalent  to  maximum  likelihood  (ML) 
estimation  of  the  SFNM  parameters. 

Anatomical  structure,  in  addition  to  the  the  results  of  tissue  quantifi¬ 
cation  that  reveals  different  tissue  properties,  provides  very  valuable  in¬ 
formation  in  medical  applications.  Tissue  segmentation  is  a  technique  for 
partitioning  the  image  into  meaningful  regions  corresponding  to  the  ob¬ 
jects.  Tissue  segmentation  may  be  considered  as  a  clustering  process  where 
the  pixels  are  classified  into  the  attributed  tissue  types  according  to  their 
gray-level  values  and  spatial  correlation.  A  reasonable  assumption  is  the 
spatially  close  pixels  are  likely  to  belong  to  the  same  tissue  type.  Accord¬ 
ingly,  tissue  segmentation  addresses  the  realization  of  context  images  k, 
i  =  1, •  •  •  ,iV,  given  the  observed  pixel  images  x.  Based  on  the  localized 
SFNM  formulation  (3),  a  deterministic  relaxation  labeling  can  be  used  to 
update  the  context  images  after  global  tissue  quantification  by  locally  min¬ 
imizing  the  pixel  classification  error.  With  a  motivation  similar  to  the  one 
in  [2,  3],  the  general  technique  seeks  for  a  consistent  labeling  solution  where 
the  criterion  is  to  maximize  global  consistency  measure  by  using  a  system 
of  inequalities.  The  structure  of  relaxation  labeling  is  motivated  by  two  ba¬ 
sic  considerations:  1)  decomposition  of  a  global  computation  scheme  into 
a  network  performing  simple  local  computations;  2)  suitable  use  of  local 
context  regularities  in  resolving  ambiguities. 


3.  METHOD  AND  ALGORITHMS 

In  this  paper,  we  present  the  theory  and  algorithms  for  the  two  stages; 
(1)  quantification  which  involves  network  order  selection  and  adaptive  com¬ 
putation  of  the  parameters  to  achieve  both  classification,  and  (2)  segmen¬ 
tation  which  uses  the  order  and  the  parameters  computed  in  the  quantifi¬ 
cation  stage  to  perform  hard  classification  by  incorporating  local  context 
constraints. 
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Since  the  prior  knowledge  on  the  true  structure  of  a  real  image  is  gener¬ 
ally  unknown,  it  is  most  often  desirable  to  have  a  neural  network  structure 
that  is  adaptive,  in  the  sense  that  the  number  of  local  components  is  not 
fixed  beforehand.  In  the  probabilistic  neural  network  scheme  we  propose  for 
MR  image  analysis,  using  a  smaller  or  larger  number  of  mixture  components 
than  the  number  of  tissue  types  represented  on  a  particular  slice  will  result 
in  incorrect  identification  and  quantification  of  the  tissues  in  that  partic¬ 
ular  slice.  This  situation  is  particularly  critical  in  real  clinical  application 
where  the  structure  of  the  individual  slice  for  a  particular  patient  may  be 
arbitrarily  complex.  We  proposed  a  new  information  theoretic  criterion  for¬ 
mulation,  the  MCBV  criterion,  to  solve  the  model  selection  problem.  Our 
approach  has  a  simple  optimal  appeal  in  that  it  selects  a  minimum  condi¬ 
tional  bias  and  variance  model,  i.e.,  if  two  models  are  about  equally  fikely, 
MCBV  selects  the  one  whose  parameters  can  be  estimated  with  the  small¬ 
est  variance.  A  practical  MCBV  formulation  with  code-length  expression 
is  further  given  by 


•^a  -j 

MCBV(Ar)  =  -  log(£(x|fML))  +  X!  9  log27reVar(ffcMi,)  (5) 

k=l  ^ 


where  £(•)  denotes  the  joint  likelihood  function  and  TkML  is  the  maximum 
likelihood  estimate. 

Recently,  on-line  versions  of  the  EM  algorithm  are  proposed  for  large 
scale  sequential  learning  in  maximum  likelihood  estimation.  Such  a  pro¬ 
cedure  obviates  the  need  to  store  all  the  incoming  observations,  changes 
the  parameters  immediately  after  each  data  point  allowing  for  high  data 
rates.  The  PSOM  we  present  here  is  a  fully  unsupervised  and  incremental 
stochastic  learning  algorithm.  The  scheme  provides  winner-takes-in  proba¬ 
bility  (Bayesian  “soft”)  splits  of  the  data,  hence  allowing  the  data  to  con¬ 
tribute  simultaneously  to  multiple  tissues.  By  adopting  a  stochastic  gra¬ 
dient  descent  scheme  for  minimizing  T>(/x||/r)>  the  corresponding  on-line 
formulation  is  obtained  by 


af fc  =  1, ir. 


,(«+!)  _  -W 


(6) 

(7) 

(8) 


where  the  variance  factors  are  incorporated  into  the  learning  rates  while 
the  posterior  Bayesian  probabilities  are  kept,  and  a{t)  and  6(t)  are  intro¬ 
duced  as  the  learning  rates,  two  sequences  converging  to  zero,  ensuring 
unbiased  estimates  after  convergence.  Self  organization  at  both  the  neu¬ 
ron  and  modular  levels  refers  to  a  specific  human  brain  capability,  which 
tends  to  convert  the  similarity  of  input  features  into  the  proximity  of  finite 
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participating  neurons  [5,  10].  Mapping  this  operation  to  the  PSOM,  we  de¬ 
sign  a  network  where  both  the  structure  and  weights  are  updated  according 
to  an  unsupervised  learning  algorithm.  More  precisely,  the  network  orga¬ 
nizes  itself  to  efficiently  map  the  data  to  the  feature  space  through  adaptive 
mechanisms  where  the  information  theoretic  criteria  are  shown  to  provide 
a  reasonable  approach  for  the  solution  of  the  problem.  In  p2irticular,  both 
structure  and  weights  of  the  PSOM  “compete”  for  the  assignment  order  of 
each  model  and  assignment  probability  of  each  observation.  Overall  con¬ 
vergence  dynamics  of  the  PSOM  are  simileir  to  SOM  in  that  a  solution  is 
obtained  by  “resonating”  between  input  data  and  an  internal  representa¬ 
tion.  Such  a  mechanism  can  be  considered  as  a  more  realistic  learning  than 
the  batch  EM  procedure. 

Given  the  SFNM  parameters,  i.e.  the  image  components  computed  by 
ML  principle,  there  are  several  approaches  to  perform  pixel  classification. 
When  the  true  pixel  labels  I*  are  considered  to  be  functionally  independent 
and  non-random  constants,  competitive  learning  approaches  can  be  used  for 
the  segmentation  of  different  tissue  types.  We  can  define  the  consistency 
of  discrete  relaxation  labeling  and  formalize  its  relationship  to  global  opti¬ 
mization  as  follows:  We  first  define  the  component  in  the  localized  SFNM 
distribution  (3)  as  a  support  function: 

Si(k)  =  exp  (9) 

Note  that  the  support  function  Si(k)  is  a  function  of  the  component  (tissue 
type)  k.  Then  tissue  segmentation  is  interpreted  as  the  satisfaction  of  a 
system  of  inequalities: 

Si(li)  >  Si{k),  (10) 

for  all  k  and  for  i  =  1,  •  •  • ,  JV,  where  a  consistent  labeling  is  defined  as  the 
one  having  maximum  support  at  each  pixel  simultaneously.  We  further 
define  the  average  local  consistency  measure 

N 

i~l  k 

to  link  consistent  labeling  to  global  optimization.  It  is  shown  that  when 
the  spatial  compatibility  measure  is  symmetric  and  -4(1)  attains  a  local 
maximum  at  1,  then  1  is  a  consistent  labeling  [2,  6,  8].  Hence,  a  consis¬ 
tent  labeling  can  be  accomplished  by  locally  maximizing  4l(l).  We  propose 
a  probabilistic  constraint  relaxation  network  (PCRN)  to  perform  contex¬ 
tual  tissue  segmentation  by  imposing  neighborhood  context  regularities  to 
alleviate  the  ambiguity  problem.  PCRN  uses  stochastic  discrete  gradient 
descent  procedure  where  each  pixel  is  randomly  visited  and  its  label  is 
updated,  i.e.,  pixel  i  is  classified  into  the  A;th  region  if 

h  =  arg  jmm  (log(<T|)  -  21og{4‘’)  +  (Xi  -  |  (12) 
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where  is  defined  in  (2). 

Rather  than  minimizing  an  energy  function  but  looking  for  a  possible 
local  maximum  of  a  global  consistency  measure,  in  PCRN  the  input  layer 
has  a  neuron  that  corresponds  to  each  pixel  image  and  the  output  layer 
has  a  neuron  that  corresponds  to  the  label  of  the  original  image.  Compe¬ 
tition  within  hidden  layer  ensures  that  only  one  neuron  becomes  active  at 
any  pixel  location.  Gating  between  output  and  the  hidden  layer  incorpo¬ 
rates  the  local  labeling  information  to  provide  locally  consistent  labeling 
and  hence  to  remove  the  ambiguities.  Reciprocal  feedback  from  output  to 
gating  unit  allows  each  hidden  neuron  to  control  its  activation.  We  can 
view  consistency  as  a  “locking-in”  property,  i.e.,  since  the  support  func¬ 
tion  defined  for  a  given  pixel  depends  on  the  current  labels  at  neighboring 
pixels,  this  neighborhood  influences  the  update  of  the  given  pixel  through 
probabilistic  compatibility  constraints.  With  the  constraint  propagation, 
the  relaxation  process  iteratively  updates  the  label  assignments  to  increase 
the  consistency.  And  a  more  consistent  labeling  with  the  neighboring  labels, 
ideally  so  that  each  pixel  is  designated  a  unique  label  [2]. 


4.  EXPERIENTAL  RESULTS  AND  DISCUSSIONS 

In  this  section,  we  present  results  using  the  probabilistic  neural  net¬ 
work  based  approach  we  introduced  to  quantify  and  segment  tissue  types 
from  real  MR  brain  images.  We  present  a  simulation  study  to  test  the 
performance  of  model  identification  (selection  and  quantification)  with  the 
proposed  criterion  (MCBV).  We  generate  a  test  data  with  up  of  four  over¬ 
lapping  normal  components.  Each  component  represents  one  local  cluster. 
The  value  for  each  component  is  set  to  a  constant  value  and  normal  dis¬ 
tributed  noise  is  then  added  to  the  data.  The  phantom,  the  MCBV  curve 
as  a  function  of  the  number  of  local  clusters  A,  and  the  final  distribution 
learning,  are  plotted  in  Figure  1.  According  to  the  information  theoretic 
criteria,  the  minima  of  the  curve  indicate  the  correct  number  of  the  image 
components.  The  result  shows  that  the  number  of  local  clusters  suggested 
by  the  new  criterion  is  correct  and  the  histogram-model  fitting  is  satisfac¬ 
tory. 

For  the  real  MR  brain  image,  information  theoretic  criterion  is  first  ap¬ 
plied  to  detect  the  number  of  tissue  types  thus  allowing  the  corresponding 
network  to  adapt  its  structure  for  the  best  representation  of  the  data.  The 
PSOM  algorithm  is  used  to  quantify  the  parameters  of  the  tissue  types  lead¬ 
ing  to  a  ML  estimation.  Segmentation  of  identified  tissue  components  is 
then  implemented  by  PCRN  through  contextual  Bayesian  decision.  Figure 
2  (a)  shows  the  original  data  consisting  of  pure  brain  tissues,  Tl-weighted 
image  parallel  to  the  AC-PC  line,  acquired  with  a  GE  Sigma  1.5  Tesla  sys¬ 
tem.  The  imaging  parameters  are  TR  35,  TE  5,  flip  angle  45®,  1.5  mm 
effective  slice  thickness.  The  corresponding  histogram  is  given  in  Figure  2 
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Figure  1:  Experimental  results  of  model  selection  and  final  quantification 
on  the  simulated  image. 


(b).  that  has  a  considerably  complex  characteristics  since  the  tissue  types 
are  all  highly  overlapping.  Evaluation  of  different  image  analysis  techniques 
is  a  particularly  difficult  task  and  dependability  of  evaluations  by  simple 
mathematical  measures  is  largely  in  question.  The  quality  of  the  quanti¬ 
fied  and  segmented  image  usually  depends  heavily  on  the  subjective  and 
qualitative  judgements.  In  this  study,  besides  the  evaluation  performed  by 
radiologists,  we  use  the  global  relative  entropy  (GRE)  value  to  reflect  the 
quality  of  tissue  quantification  and  for  assessment  of  tissue  segmentation, 
we  use  post-segmentation  sample  averages  as  an  indirect  but  objective  cri¬ 
terion.  As  discussed  in  the  literature,  the  brain  is  generally  composed  of 
three  principal  tissue  types,  i.e.,  WM,  GM,  CSF,  and  their  pair-wise  combi¬ 
nations,  C2Jled  partial  volume  effect.  Since  the  MRI  scans  clearly  show  the 
distinctive  intensities  at  the  local  barin  areas,  the  functional  tissue  types 
need  to  be  considered.  We  let  Kmin  —  2  and  Kmax  =  9  and  calculate 
MCBV(Rr)  (5)  {K  =  Kmin,  --,Kmax)^  The  result  suggested  that  the  brain 
image  contains  8  tissue  types.  When  performing  the  computation  of  the 
information  theoretic  criteria,  we  used  PSOM  to  iteratively  quantify  differ¬ 
ent  tissue  types  for  each  fixed  K.  The  results  of  final  tissue  quantification 
with  RTo  =  8  is  shown  in  Figure  2  (b)  where  a  GRE  value  of  0.02  -  0.04  nats 
is  achieved.  The  PCRN  tissue  segmentation  is  performed  where  PCRN  up¬ 
dates  are  terminated  after  5-10  iterations  since  further  iterations  produced 
almost  identical  results.  The  segmentation  result  is  shown  in  Figure  2  (c). 
Although  the  segmentation  contains  some  small  isolated  spots  (less  than 
4-pixel  size),  the  PCRN  approach  is  quite  encouraging.  These  quantified 
tissue  types  agree  with  that  of  a  physician’s  qualitative  analysis  results. 

We  also  present  a  comparison  of  the  performance  of  PSOM  with  that 
of  the  EM  and  the  competitive  learning  (CL)  algorithms  in  MR  brain  tis¬ 
sue  quantification,  to  evaluate  the  computational  accuracy  and  efficiency  of 
the  algorithm  in  the  standard  finite  normal  mixture  (SFNM)  distribution 
learning,  based  on  the  objective  criterion  and  learning  curves.  We  applied 
all  the  methods  to  the  same  example  and  used  the  GRE  value  between 
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Figure  2:  Results  of  MR  brain  tissue  quantification  and  segmentation. 


the  image  histogram  and  the  estimated  SFNM  distribution  as  the  goodness 
criterion  to  evaluate  the  quantification  error.  Figure  3  (a)  shows  learning 
curves  of  the  PSOM  and  competitive  learning  (CL),  averaged  over  5  inde¬ 
pendent  runs.  As  observed  in  the  figure,  PSOM  outperforms  CL  learning 
by  faster  convergence  and  lower  quantification  error,  and  reaches  a  final 
GRE  value  of  about  0.04  nats.  Figure  3  (b)  presents  the  comparison  of 
PSOM  with  that  of  the  EM  algorithm  for  25  epochs.  As  seen  in  the  learn¬ 
ing  curves,  PSOM  algorithm  again  shows  superior  estimation  performance. 
The  final  quantification  error  is  about  0.02  nats  while  preserving  the  faster 
convergence  rate. 


EM  (right). 


5.  CONCLUSIONS 

Our  main  contribution  is  the  complete  proposal  of  a  three-step  learning 
strategy  for  determination  of  both  modular  structure  and  components  of 
the  network.  In  this  approach,  the  network  structure  (in  terms  of  suitability 
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of  the  statistical  model)  is  justified  in  the  first  step.  It  is  followed  by  soft 
segmentation  of  data  such  that  each  data  point  supports  all  local  compo¬ 
nents  simultaneously.  The  associated  probabilistic  labels  are  then  realized 
in  the  third  step  by  competitive  learning  of  this  induced  hard  classification 
task.  The  main  limitations  of  current  approach  are  that,  1)  it  requires  the 
testing  of  all  possible  network  structure  candidates  during  the  model  fit¬ 
ting  procedure,  hence  is  not  efficient  especially  for  processing  MR  sequence 
images  where  an  on-line  learning  is  preferred,  and  2)  applications  to  real 
MR  data  indicates  the  possibility  of  being  trapped  in  a  local  minimum  in 
ML  estimation  by  the  PSOM  since  there  is  no  guarantee  of  attaining  the 
global  minimum.  To  summarize,  the  results  of  the  experiments  we  have  per¬ 
formed,  indicate  the  plausibility  of  this  approach  for  brain  tissue  analysis 
from  MRI  scans,  and  show  that  it  can  be  applied  to  clinical  problems  such 
as  those  encountered  in  tissue  segmentation  and  quantitative  diagnosis. 
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